- Type
- Model evaluation tool
- Applies to
- Classification problems
- Cells
- True positives, false positives, true negatives, false negatives
- Derived metrics
- Accuracy, precision, recall, F1, specificity
- Related
- Precision and recall, ROC curve, model benchmarking
- Type
- Model evaluation tool
- Applies to
- Classification problems
- Cells
- True positives, false positives, true negatives, false negatives
- Derived metrics
- Accuracy, precision, recall, F1, specificity
- Related
- Precision and recall, ROC curve, model benchmarking
A confusion matrix is a tabular summary used to evaluate the performance of a classification model. Each row of the table corresponds to the actual class of an example and each column to the class the model predicted, or the reverse depending on convention. By laying out how often predictions match reality, the matrix reveals not only how many mistakes a model makes but what kinds of mistakes, information that a single accuracy figure conceals.
Structure for Binary Classification
In the simplest case of two classes, usually called positive and negative, the matrix has four cells. A true positive (TP) is a positive example correctly predicted as positive. A true negative (TN) is a negative example correctly predicted as negative. A false positive (FP), also called a Type I error, is a negative example wrongly predicted as positive. A false negative (FN), or Type II error, is a positive example wrongly predicted as negative. Every prediction the model makes falls into exactly one of these four cells, and their sum equals the number of evaluated examples.
Metrics Derived from the Matrix
The four counts combine into several widely used metrics. Accuracy is the proportion of all correct predictions, computed as (TP + TN) / (TP + TN + FP + FN). Precision measures how many predicted positives are truly positive, TP / (TP + FP), and answers how trustworthy a positive prediction is. Recall, also called sensitivity or the true positive rate, measures how many actual positives were found, TP / (TP + FN). Specificity is the true negative rate, TN / (TN + FP). The F1 score is the harmonic mean of precision and recall, 2 (precision recall) / (precision + recall), giving a single figure that balances the two.
Different applications weight these metrics differently. In cancer screening a false negative is far more costly than a false positive, so recall is prioritised. In email spam filtering a false positive that discards a legitimate message may be worse than letting spam through, so precision matters more.
Why Accuracy Alone Misleads
The confusion matrix is especially valuable when classes are imbalanced. If ninety-five percent of transactions are legitimate, a fraud detector that labels every transaction as legitimate achieves ninety-five percent accuracy while catching no fraud at all. Its confusion matrix would immediately expose zero true positives and a large false negative count, showing the model to be useless despite its high accuracy. This is why practitioners inspect the full matrix rather than relying on a headline number.
Multi-Class and Extensions
For problems with more than two classes, the confusion matrix becomes an n-by-n grid in which the diagonal holds correct predictions and off-diagonal cells show which classes are confused for which. This helps diagnose systematic errors, such as a vision model that repeatedly mistakes one animal species for a similar one. Per-class precision and recall can be computed and then averaged, either weighting each class equally (macro-averaging) or by its frequency (micro or weighted averaging).
Related evaluation tools build on the same counts. The receiver operating characteristic (ROC) curve plots the true positive rate against the false positive rate across decision thresholds, and the area under it summarises ranking quality. Because most classifiers output a probability, moving the threshold shifts entries between the matrix cells, letting teams tune the balance between false positives and false negatives to suit the application.
The confusion matrix is a routine but important tool for Malaysian organisations that must justify automated decisions to regulators. Banks and insurers supervised by Bank Negara Malaysia and the fintech frameworks of the Securities Commission Malaysia are expected to demonstrate that credit-scoring and fraud-detection models perform reliably and do not systematically disadvantage groups of customers. Reporting precision, recall, and per-segment confusion matrices provides the evidence auditors and risk committees require, and supports the responsible-AI expectations set out in national guidance.
In healthcare, models used by Ministry of Health facilities for triage or diagnostic support are assessed on sensitivity and specificity, quantities read directly from the confusion matrix, because the cost of a missed diagnosis differs sharply from that of a false alarm. Similar scrutiny applies to public-safety and immigration applications overseen by agencies where false positives carry real consequences for individuals.
The emphasis on transparent evaluation aligns with Malaysia's National Guidelines on AI Governance and Ethics and the oversight interests of the National AI Office and CyberSecurity Malaysia. Training programmes funded by HRD Corp and delivered through MDEC-linked initiatives teach data scientists to move beyond accuracy toward threshold-aware, class-balanced evaluation, helping Malaysian firms deploy classifiers that are both effective and defensible.