Text Summarisation
Text summarisation is the natural language processing task of producing a shorter version of a document that preserves its key information, using extractive or abstractive techniques.
Text summarisation is the natural language processing task of producing a shorter version of one or more source documents while preserving the most important information. It is one of the oldest problems in NLP, with research dating to Luhn's 1958 work on automatic abstracting at IBM, and remains a benchmark capability for modern large language models.
Extractive summarisation
Extractive summarisation selects existing sentences or spans from the source document and concatenates them into a summary. Classical approaches score sentences using frequency-based heuristics such as TF-IDF, graph-based methods such as TextRank and LexRank, or supervised classifiers trained to predict whether each sentence should be included. Modern extractive systems use transformer encoders such as BERT to score sentences in context, often combined with sequence labelling or pointer networks. Because all output text comes verbatim from the input, extractive summaries are unlikely to hallucinate, but they often read as choppy and miss high-level themes.
Abstractive summarisation
Abstractive summarisation generates novel text that paraphrases and restructures the source. Early neural approaches used encoder-decoder recurrent neural networks with attention mechanisms. Pretrained encoder-decoder transformers such as BART, T5, and Pegasus, fine-tuned on summarisation datasets, set the state of the art in the late 2010s and early 2020s. Pegasus introduced gap-sentence pretraining specifically designed for summarisation. Large general-purpose language models including GPT-4, Claude, Gemini, and Llama now perform abstractive summarisation in zero-shot or few-shot settings with quality often matching or exceeding fine-tuned specialised models, particularly for long documents handled through long-context architectures or retrieval-augmented generation.
Hybrid and structured approaches
Hybrid systems combine extractive selection with abstractive rewriting, either through pipeline architectures or end-to-end models such as bottom-up summarisers. Structured approaches produce summaries aligned to a schema — for example, news bullet points, executive summaries, medical discharge notes, or legal briefs — improving downstream usability and supporting evaluation against templates.
Evaluation
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) remains the most widely reported metric, measuring n-gram overlap between system and reference summaries. ROUGE-1, ROUGE-2, and ROUGE-L variants capture different overlap granularities. Newer metrics including BERTScore, BLEURT, and learned-reward metrics correlate better with human judgement. Faithfulness — whether a summary's claims are supported by the source — is increasingly evaluated using natural language inference models and dedicated factuality benchmarks such as FactCC and SummaC. Human evaluation along fluency, informativeness, and faithfulness axes remains the gold standard for production systems.
Common challenges
Hallucination, in which a generated summary asserts facts not present in the source, is the central challenge for abstractive systems. Long-document summarisation strains context windows and dilutes attention. Multi-document summarisation must reconcile conflicting information across sources. Domain-specific summarisation — for example, medical literature or legal opinions — requires terminology coverage and respect for safety-critical accuracy. Low-resource languages and code-mixed text present additional difficulties addressed through cross-lingual transfer and multilingual pretraining.
Applications
Text summarisation is deployed in news aggregators, search snippets, meeting transcription tools, legal e-discovery, scientific literature search, contact-centre call summarisation, and clinical note generation. It is also a core component of retrieval-augmented generation pipelines, where retrieved passages are summarised or compressed before being passed to a downstream model.
References
- Luhn, H. P. (1958). The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development.
- Lewis, M. et al. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation. ACL.
- Zhang, J. et al. (2020). PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. ICML.
- Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. ACL Workshop.