AIWiki
Malaysia

TF-IDF

TF-IDF is a statistical weighting scheme that measures how important a word is to a document within a collection, widely used in information retrieval and text mining.

4 min readLast updated July 2026Foundations

TF-IDF, short for term frequency-inverse document frequency, is a numerical statistic that reflects how important a word is to a particular document within a larger collection, or corpus. It is one of the most widely used methods for converting text into numeric features and has long served as a backbone of information retrieval, text mining, and document classification. The core intuition is that a word carries more descriptive value for a document when it appears often in that document but rarely across the collection as a whole.

The two components

TF-IDF is the product of two separate measures that pull in complementary directions.

Term frequency (TF)

Term frequency measures how often a term appears within a single document, typically as the ratio of the count of the term to the total number of terms in that document. A word that appears many times in a document is assumed to be more relevant to its content. Raw counts are often normalised or dampened, for example by taking a logarithm, so that a term appearing a hundred times is not treated as a hundred times more important than one appearing once.

Inverse document frequency (IDF)

Inverse document frequency measures how rare a term is across the whole corpus. Words that appear in almost every document, such as common function words, receive low IDF scores, while words that appear in only a few documents receive high scores. IDF is usually computed as the logarithm of the total number of documents divided by the number of documents containing the term. This component downweights ubiquitous words that carry little distinguishing information.

The final TF-IDF weight is obtained by multiplying the two: weight = tf * idf. A term achieves a high weight only when it is frequent in the current document and uncommon elsewhere, which is precisely the profile of a good keyword.

Applications

Because TF-IDF turns free text into fixed-length numeric vectors, it enables a wide range of downstream tasks. In search and information retrieval, documents are ranked by the TF-IDF similarity of their terms to the query. In keyword extraction, the highest-weighted terms in a document summarise its topic. As a feature extractor, TF-IDF vectors feed classical machine learning classifiers for tasks such as spam filtering, sentiment analysis, and topic categorisation.

| Use case | Role of TF-IDF | | --- | --- | | Search ranking | Score document relevance to a query | | Keyword extraction | Surface a document's most distinctive terms | | Document classification | Provide numeric features for classifiers | | Clustering | Represent documents for similarity grouping |

Relationship to modern methods

TF-IDF is a sparse, term-matching representation: it counts exact words and cannot recognise that "car" and "automobile" are related. Modern systems increasingly use dense embeddings and transformer models that capture semantic meaning, and probabilistic ranking functions such as BM25 refine the same underlying ideas with better length normalisation. Nonetheless, TF-IDF remains valuable. It is fast, interpretable, requires no training, and often serves as a strong baseline or as the lexical half of hybrid search systems that combine keyword and semantic retrieval.

References

  1. Sparck Jones, K. (1972). A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation.
  2. Manning, C., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  3. Zilliz. (2024). TF-IDF: Understanding Term Frequency-Inverse Document Frequency in NLP. zilliz.com.