Note #1: Clustering Basics
19 Aug 2018Clustering goals
Partition a set of points into disjoint sets such that:
- Points in the same group are similar
- Points in different groups are dissimilar
- Formally, we want a graph partition with minimal cut
Clustering variants
- Similarity-based: given a N x N dissimilarity matrix or distance matrix D
- Feature-based: given a N x D feature matrix or design matrix X
- Flat clustering: partition into disjoint sets
- Hierarchical clustering: partitions are organised in a tree structure
Associated problems
- Multi-view clustering: cluster feature vectors, not data points
- Label prediction: given a point, predict its label
Clustering algorithms
- k-means
- Hierarchical clustering
- Gaussian mixtures
- Spectral clustering
- Affinity propagation
- Infinite mixture model (Dirichlet process)
Clustering metrics
- Graph cuts - cut score & conductance of the graph
- Purity - requires ground truth class labels, does not penalise for # clusters
- RAND score
- Mutual information