AIWiki
Malaysia

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer-based language model developed by Google that reads text bidirectionally to understand word context in natural language tasks.

6 min readLast updated June 2026Models

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained natural language processing (NLP) model developed by Google AI Language and introduced in October 2018. It represented a significant leap in the ability of machines to understand human language by reading text in both directions simultaneously — left-to-right and right-to-left — rather than in a single sequential direction as earlier models did. BERT's architecture and training methodology became a foundational template for virtually all subsequent large language models.

Architecture

BERT is built on the Transformer encoder architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. Unlike GPT, which uses the decoder portion of the Transformer, BERT uses only the encoder stack. This makes BERT particularly suited to tasks requiring understanding the full context of a sentence rather than generating new text.

The original BERT was released in two sizes. BERT-Base contains 12 transformer encoder layers, 12 attention heads, and 110 million parameters. BERT-Large contains 24 layers, 16 attention heads, and 340 million parameters. Both variants take as input a sequence of tokens and output a contextualised embedding for every token in the sequence.

Positional embeddings are added to token embeddings so the model retains information about word order. A special classification token (CLS) is prepended to every input sequence, and a separator token (SEP) is used to distinguish between paired sentences. The CLS token's output embedding is commonly used as the sentence-level representation for classification tasks.

Training Methodology

BERT is trained using two unsupervised pre-training objectives: Masked Language Modelling (MLM) and Next Sentence Prediction (NSP).

In Masked Language Modelling, a random subset of input tokens — approximately 15 percent — are replaced with a special MASK token, and the model is trained to predict the original tokens from the surrounding context. This bidirectional masking strategy forces the model to learn representations that incorporate both left and right context simultaneously, which is the key innovation distinguishing BERT from earlier unidirectional models such as GPT-1 and ELMo.

In Next Sentence Prediction, the model is given pairs of sentences and trained to predict whether the second sentence follows the first in the original text. This task was designed to help BERT understand inter-sentence relationships, which is relevant for tasks such as question answering and natural language inference.

BERT was pre-trained on the BookCorpus (800 million words) and English Wikipedia (2.5 billion words), totalling roughly 3.3 billion tokens.

Fine-Tuning

One of BERT's defining contributions was demonstrating that a single pre-trained model could be fine-tuned with minimal task-specific modifications to achieve state-of-the-art performance across a wide range of NLP benchmarks. Fine-tuning typically adds a small output layer on top of the pre-trained BERT encoder and trains the combined model on labelled data for the target task.

Tasks BERT has been fine-tuned for include sentiment analysis, named entity recognition, question answering (including the Stanford Question Answering Dataset, SQuAD), text classification, natural language inference, and semantic textual similarity. On the GLUE benchmark — a suite of NLP evaluation tasks — BERT significantly outperformed all prior approaches at release.

Variants and Descendants

The success of BERT prompted a large family of derivative models. RoBERTa (Robustly Optimised BERT Pre-training Approach), developed by Facebook AI Research, removed the NSP objective and trained on larger data with larger batches, achieving improved performance. DistilBERT, produced via knowledge distillation, retains approximately 97 percent of BERT's performance at 40 percent smaller size and 60 percent faster inference. ALBERT (A Lite BERT) introduced parameter sharing across layers to reduce model size without proportional performance loss.

Domain-specific variants include BioBERT for biomedical text, LegalBERT for legal documents, FinBERT for financial text, and multilingual variants such as mBERT, which supports over 100 languages from a single pre-trained checkpoint.

Impact and Legacy

BERT fundamentally changed the NLP research landscape. Prior to BERT, NLP systems typically relied on task-specific architectures with limited transfer across domains. BERT demonstrated that a large general-purpose pre-trained encoder, fine-tuned on small task-specific datasets, could outperform bespoke models trained from scratch on large task-specific datasets.

Google deployed BERT in Google Search in 2019, reporting it as one of the most significant improvements to the search algorithm in five years. The model improved understanding of natural language queries, particularly for longer, conversational searches where prepositions and word order carry significant meaning.

By 2025, BERT-family models remain widely used for natural language understanding tasks in production systems, even as generative LLMs have become dominant for text generation. BERT's encoder-only design makes it computationally efficient for classification and semantic embedding applications at scale.

See Also

References

  1. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019.
  2. Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
  3. Sanh, V. et al. (2019). DistilBERT, a distilled version of BERT. arXiv:1910.01108.
  4. Nayel, H. and Sharf, A. (2025). BERT applications in natural language processing: a review. Artificial Intelligence Review. Springer Nature.
  5. Google AI Blog. (2018). Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing. Google.