What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

K-Means Clustering

K-means clustering is an unsupervised machine learning algorithm that partitions a dataset into k groups by minimising the sum of squared distances between data points and their assigned cluster centroids.

4 min readLast updated May 2026Foundations

K-means clustering is one of the most widely used unsupervised learning algorithms. Given a dataset of n points in d-dimensional space and a predetermined number of clusters k, the algorithm partitions the data such that each point is assigned to the cluster whose centroid is nearest in Euclidean distance, and the centroids themselves are positioned to minimise the total within-cluster sum of squared distances.

Algorithm

The standard formulation, known as Lloyd's algorithm, alternates two steps until convergence. The assignment step assigns every point to the cluster whose centroid is closest. The update step recomputes each centroid as the mean of the points currently assigned to it. Iteration continues until assignments no longer change, the centroid shift drops below a threshold, or a maximum iteration count is reached. The procedure is guaranteed to converge to a local optimum of the within-cluster sum of squares but not necessarily the global optimum.

Initialisation

Final cluster quality depends heavily on initial centroid placement. Random initialisation may converge to poor local optima, so practitioners typically run k-means multiple times with different seeds and retain the best result. The k-means++ scheme, introduced by Arthur and Vassilvitskii in 2007, picks initial centroids with probability proportional to squared distance from the nearest already-chosen centroid, providing both empirical and theoretical improvements. Scikit-learn, MLlib, and most modern libraries use k-means++ by default.

Choosing k

The number of clusters k must be specified in advance. Common heuristics include the elbow method, plotting within-cluster sum of squares against k and selecting the inflection point; the silhouette score, which measures how similar a point is to its own cluster compared to other clusters; the gap statistic; and the Bayesian information criterion under a Gaussian mixture assumption. Domain knowledge often overrides statistical heuristics in practice.

Variants

Several variants address limitations of the standard algorithm. Mini-batch k-means scales to very large datasets by updating centroids using small random batches. K-medoids (also known as PAM) replaces the mean centroid with an actual data point, making it robust to outliers and applicable to non-Euclidean distances. Fuzzy c-means assigns each point a degree of membership in every cluster rather than a hard assignment. Spherical k-means uses cosine distance and is preferred for text and embedding clustering. Constrained k-means enforces minimum or maximum cluster sizes.

Limitations

K-means assumes clusters are roughly spherical, similar in size, and isotropic in variance. It struggles with clusters of unequal density, non-convex shapes, and varying feature scales. Sensitivity to outliers is a known issue; feature standardisation and outlier removal are routine preprocessing steps. The Euclidean distance metric implicitly assumes that features are commensurate, which often requires scaling. Categorical features require either k-prototypes, k-modes, or transformation to numeric embeddings.

Applications

K-means is applied widely as both a standalone analysis tool and a building block for larger pipelines. Common uses include customer segmentation, image colour quantisation, document clustering, anomaly detection, vector quantisation for embedding compression, initialisation for Gaussian mixture models, and partitioning for distributed nearest-neighbour search. It is also used in approximate nearest neighbour libraries such as FAISS, where inverted-file indexes rely on k-means to define coarse cells.

Malaysian Context — K-Means in Industry and Public Sector

K-means clustering is a workhorse across Malaysian industries. Maybank, CIMB, Public Bank, RHB, and Hong Leong Bank use clustering for customer segmentation, transaction behaviour analysis, and credit risk scoring, supplementing supervised models in compliance with Bank Negara Malaysia's risk management guidance. Securities Commission Malaysia-licensed asset managers use k-means for portfolio style classification and risk bucketing.

Telecommunications operators including Maxis and CelcomDigi apply k-means to subscriber segmentation, churn risk modelling, and network anomaly detection. Tenaga Nasional Berhad uses clustering on smart-meter consumption profiles to support tariff design and demand-side management. Grab Malaysia and AirAsia use clustering for trip pattern analysis and dynamic pricing inputs.

Public-sector applications include the Department of Statistics Malaysia's use of clustering on household survey data, the Ministry of Health's segmentation of hospital catchments, and the National Anti-Drug Agency's use of geographic clustering for resource allocation. Universiti Malaya, Universiti Putra Malaysia, Universiti Kebangsaan Malaysia, and Universiti Teknologi MARA teach k-means as a foundational topic in data science programmes, often within HRD Corp Claimable Courses delivered through the Centre of Applied Data Science (CADS) and MDEC-recognised Premier Digital Tech Institutions.

References

Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Berkeley Symposium on Mathematical Statistics and Probability.
Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. SODA.
Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12.

Tags:clustering unsupervised learning Lloyd algorithm centroid partitioning

Type	Unsupervised partitioning algorithm
Introduced	1957 (Lloyd), published 1982
Objective	Minimise within-cluster sum of squares
Time complexity	O(n * k * d * i) per iteration
Common variants	k-means++, mini-batch k-means, k-medoids