Precision and Recall
Precision and recall are two complementary metrics used to evaluate classification models, measuring respectively the correctness of positive predictions and the completeness with which actual positives are identified.
Precision and recall are paired evaluation metrics for classification and information-retrieval systems. Precision answers the question of how many of the items the model labelled positive are actually positive, while recall answers how many of the truly positive items the model managed to find. Because the two capture different kinds of error, they are almost always reported together, and improving one often comes at the expense of the other.
Definitions and the confusion matrix
Both metrics are computed from the four cells of a confusion matrix: true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN). Precision is defined as TP / (TP + FP), the fraction of positive predictions that are correct. Recall, also called sensitivity or the true positive rate, is TP / (TP + FN), the fraction of actual positives that were retrieved.
A model that predicts positive very rarely, only when extremely confident, will tend to have high precision but low recall, missing many genuine cases. A model that predicts positive liberally will have high recall but low precision, raising many false alarms. The right balance depends entirely on the costs attached to each type of error.
Why accuracy is not enough
Plain accuracy, the proportion of all predictions that are correct, can be deeply misleading when classes are imbalanced. If only one transaction in ten thousand is fraudulent, a model that labels every transaction as legitimate achieves 99.99 percent accuracy while detecting no fraud at all. Precision and recall expose this failure immediately, because recall on the fraud class would be zero. For this reason, imbalanced problems in fraud detection, medical screening and anomaly detection rely on precision and recall rather than accuracy.
Combining the two
To summarise both numbers in a single figure, practitioners use the F1 score, the harmonic mean of precision and recall, given by F1 = 2 * (precision * recall) / (precision + recall). The harmonic mean penalises large disparities, so a high F1 requires both metrics to be reasonably high. Where one concern dominates, the more general F-beta score weights recall more or less heavily than precision.
Because most classifiers output a continuous score rather than a hard label, the decision threshold can be tuned to trade precision against recall. Sweeping the threshold produces a precision-recall curve, and the area under it provides a threshold-independent measure of quality that is particularly informative for imbalanced data.
| Scenario | Priority | Rationale | | --- | --- | --- | | Medical screening | Recall | Missing a true case is costly | | Spam filtering | Precision | Blocking a real email is costly | | Search ranking | Both, via F1 | Relevance and completeness matter | | Fraud detection | Balanced, tuned | Errors carry asymmetric cost |
Use in information retrieval
The concepts originate in information retrieval, where precision measures the relevance of returned documents and recall measures coverage of all relevant documents. Modern search and recommendation systems, including those built on vector databases and semantic search, continue to report precision at k and recall at k to describe ranking quality.
References
- Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
- Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. Journal of Machine Learning Technologies.
- Davis, J. and Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. Proceedings of the 23rd International Conference on Machine Learning.