K-Means Clustering
K-means clustering is an unsupervised machine learning algorithm that partitions a dataset into k groups by minimising the sum of squared distances between data points and their assigned cluster centroids.
K-means clustering is one of the most widely used unsupervised learning algorithms. Given a dataset of n points in d-dimensional space and a predetermined number of clusters k, the algorithm partitions the data such that each point is assigned to the cluster whose centroid is nearest in Euclidean distance, and the centroids themselves are positioned to minimise the total within-cluster sum of squared distances.
Algorithm
The standard formulation, known as Lloyd's algorithm, alternates two steps until convergence. The assignment step assigns every point to the cluster whose centroid is closest. The update step recomputes each centroid as the mean of the points currently assigned to it. Iteration continues until assignments no longer change, the centroid shift drops below a threshold, or a maximum iteration count is reached. The procedure is guaranteed to converge to a local optimum of the within-cluster sum of squares but not necessarily the global optimum.
Initialisation
Final cluster quality depends heavily on initial centroid placement. Random initialisation may converge to poor local optima, so practitioners typically run k-means multiple times with different seeds and retain the best result. The k-means++ scheme, introduced by Arthur and Vassilvitskii in 2007, picks initial centroids with probability proportional to squared distance from the nearest already-chosen centroid, providing both empirical and theoretical improvements. Scikit-learn, MLlib, and most modern libraries use k-means++ by default.
Choosing k
The number of clusters k must be specified in advance. Common heuristics include the elbow method, plotting within-cluster sum of squares against k and selecting the inflection point; the silhouette score, which measures how similar a point is to its own cluster compared to other clusters; the gap statistic; and the Bayesian information criterion under a Gaussian mixture assumption. Domain knowledge often overrides statistical heuristics in practice.
Variants
Several variants address limitations of the standard algorithm. Mini-batch k-means scales to very large datasets by updating centroids using small random batches. K-medoids (also known as PAM) replaces the mean centroid with an actual data point, making it robust to outliers and applicable to non-Euclidean distances. Fuzzy c-means assigns each point a degree of membership in every cluster rather than a hard assignment. Spherical k-means uses cosine distance and is preferred for text and embedding clustering. Constrained k-means enforces minimum or maximum cluster sizes.
Limitations
K-means assumes clusters are roughly spherical, similar in size, and isotropic in variance. It struggles with clusters of unequal density, non-convex shapes, and varying feature scales. Sensitivity to outliers is a known issue; feature standardisation and outlier removal are routine preprocessing steps. The Euclidean distance metric implicitly assumes that features are commensurate, which often requires scaling. Categorical features require either k-prototypes, k-modes, or transformation to numeric embeddings.
Applications
K-means is applied widely as both a standalone analysis tool and a building block for larger pipelines. Common uses include customer segmentation, image colour quantisation, document clustering, anomaly detection, vector quantisation for embedding compression, initialisation for Gaussian mixture models, and partitioning for distributed nearest-neighbour search. It is also used in approximate nearest neighbour libraries such as FAISS, where inverted-file indexes rely on k-means to define coarse cells.
References
- Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory.
- MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Berkeley Symposium on Mathematical Statistics and Probability.
- Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. SODA.
- Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12.