[Data Science] Clustering (2) Partitioning Method; K-Means, PAM(K-Medoids), K-modes, CLARA

이수빈·2023년 6월 18일

CLARA Data Science K-Medoids K-Modes clustering k-means pam

강의-데이터사이언스

목록 보기

6/9

0 Partitioning Method

☑️ what) N개의 데이터를 K개의 클러스터로 나눈다.
클러스터의 representative (e.g. centroid, medoid)를 정하고, 다음 식의 클러스터별 총합이 최소가 되도록 나눈다.
🥲 pb) $K$ 가 hyper-parameter이다, non-convex 모양의 클러스터는 찾을 수 없다.

\sum^K_{m=1} \sum^{N_m}_{i=1} (c_m - t_i^m)^2

where
$K$ : the number of clusters (pre-defined)
$N_m$ : the number of data points in the m-th cluster
$c_m$ : a representative of the m-th cluster
$t_i^m$ : i-th data point in the m-th cluster

1 K-Means, PAM(K-Medoids), K-modes, CLARA

	K-Menas	PAM (Partitioning Around Medoid) ; K-Medoids
☑️ what)	Centroid를 클러스터의 기준점으로 하는 Pratitioning clustering method	Medoid를 클러스터의 기준점으로 하는 Pratitioning clustering method
cf)	centroid는 real data point가 아닌 logical center이다.	medois는 real data point이다.
Time Complexity	$O(nkt)$ where t is a # of iteration.	PAM: $O(ik(n-k)^2)$ CLARA: $O(ks^2 + k(n-k))$
👍🏻 gd)	k, t << n 이므로 시간복잡도를 $O(n)$ 으로 상정할 수 있다.	more robust than k-means; outlier의 significant한 효과를 없앨 수 있다.
🥲 pb)	local optimum에 수렴할 수도 있다. Noise, outlier를 다룰 수 없다.	$O(n^2)$ 의 시간복잡도로, 스케일이 어렵다. -> sol) CLARA
Variation	K-modes for categorical-only data - data distance: 미스매치 페어 개수, $d(X,Y) = \sum^m_{i=1}\delta(x_i, y_i)$ where $\delta =!\oplus$ is - cluster distance: $D(X,Q)=\sum^n_{i=1} d(X_i, Q)$	CLARA (Clusrterin LARge Applications) i) 데이터셋을 샘플링한다. ii) PAM을 적용한다. iii) 전체 데이터를 샘플링 데이터셋의 medoid에 따라 클러스터링한다. - sample size 大 → accurary ↑ , time ↑ - sample size 小 → accuracy ↓ , time ↓