Natural Language Generation
Natural Language Generation (NLG) is a subfield of artificial intelligence that automatically produces human-readable text from structured data, semantic representations, or other machine-readable inputs.
Natural Language Generation (NLG) is a subfield of artificial intelligence concerned with automatically producing human-readable text from structured data, semantic representations, or machine-readable inputs. Alongside Natural Language Understanding (NLU), NLG forms the two principal components of Natural Language Processing (NLP). While NLU interprets human language, NLG produces it — enabling machines to communicate information in fluent, contextually appropriate prose.
Historical Background
Early NLG systems emerged in the 1970s from work in computational linguistics and knowledge representation. These rule-based systems, such as MUMBLE (McDonald, 1983) and FUF/SURGE (Elhadad, 1992), relied on hand-crafted grammars and templates to assemble sentences from structured knowledge bases. The outputs were grammatically correct but required extensive manual engineering for each domain.
Statistical NLG emerged in the 2000s, borrowing techniques from machine translation and corpus linguistics. Systems learned surface realisations from aligned data, reducing the need for hand-crafted rules while improving coverage. However, statistical methods struggled with long-range coherence and factual accuracy.
The advent of deep learning, and particularly the Transformer architecture introduced by Vaswani et al. in 2017, marked a turning point. Pre-trained language models such as GPT-2 (2019), T5 (2019), and GPT-3 (2020) demonstrated that large-scale neural models trained on internet-scale text could generate fluent, diverse, and contextually rich text with minimal task-specific fine-tuning.
Core Pipeline
Classical NLG systems decompose generation into a pipeline of discrete stages.
Content determination involves selecting which information from the input to express, filtering irrelevant facts and ranking the rest by importance. Discourse planning orders the selected content into a coherent narrative structure, establishing relationships such as cause-effect, contrast, or elaboration between propositions. Sentence aggregation groups related propositions into single sentences to avoid choppy, list-like outputs. Lexicalisation chooses the specific words and phrases to express each proposition, drawing on domain vocabulary and stylistic constraints. Referring expression generation decides how to refer to entities — by name, pronoun, or definite description — to maintain clarity while avoiding repetition. Surface realisation converts the abstract sentence plan into grammatical text, handling morphology, agreement, punctuation, and word order.
Modern neural NLG models collapse several of these stages into an end-to-end learned function, implicitly performing content determination, discourse planning, and surface realisation within a single forward pass through the network.
Neural Approaches
Contemporary NLG is dominated by large language models (LLMs) based on the Transformer decoder architecture. These models are pre-trained on massive text corpora using a next-token prediction objective, learning statistical regularities of language at scale. Fine-tuning or prompting then adapts the model to specific generation tasks.
Sequence-to-sequence models with encoder-decoder architectures — such as BART and T5 — are widely used for conditional generation tasks where the output depends on a specific input, such as summarisation, translation, or data-to-text generation. The encoder processes the source input and the decoder generates the target text token by token, attending to relevant parts of the encoded representation.
Instruction-tuned models such as GPT-4 and Claude respond to natural language prompts that specify the desired output format, style, and content constraints, making NLG accessible without specialised training pipelines.
Applications
NLG powers a wide range of commercial and research applications.
Automated journalism uses NLG to generate templated news articles from structured data such as financial earnings reports, sports scores, and weather forecasts. Companies such as Automated Insights and Narrative Science operate platforms that produce millions of such articles each week for wire services and corporate clients.
Business intelligence tools use NLG to convert data visualisations and dashboard metrics into executive summaries in plain English, making insights accessible to non-technical stakeholders. Chatbots and virtual assistants rely on NLG to formulate responses that are grammatically natural and tonally appropriate to the conversation context.
Clinical documentation in healthcare leverages NLG to generate discharge summaries, radiology reports, and patient letters from structured electronic health record data, reducing clinician documentation burden. Code generation, exemplified by systems such as GitHub Copilot, treats programming languages as a generation target, producing functional code from natural language specifications.
Evaluation
Evaluating NLG outputs is a persistent research challenge. Automatic metrics such as BLEU, ROUGE, and METEOR measure surface-level overlap between generated and reference text, but correlate imperfectly with human judgements of fluency, coherence, and factual accuracy. Newer metrics such as BERTScore compute semantic similarity in embedding space, partially addressing the limitation of n-gram overlap. Human evaluation remains the gold standard, assessing dimensions such as fluency, adequacy, and overall quality through crowd-sourced or expert annotation.
Factual consistency — ensuring that generated text accurately reflects the source input — has emerged as a critical dimension, particularly in healthcare and legal applications. Hallucination, where models generate plausible-sounding but incorrect content, remains an active area of research.
Challenges
Despite remarkable progress, NLG faces several open challenges. Controlling factual accuracy requires models to ground outputs in verifiable sources rather than learned statistical associations. Maintaining long-document coherence, ensuring that narratives remain internally consistent across hundreds of sentences, is difficult for autoregressive models with fixed context windows. Style control, generating text that matches a specified author voice, reading level, or cultural register, remains imprecise. Multilingual NLG for low-resource languages is constrained by scarcity of training data.
See Also
References
- Gatt, A., & Krahmer, E. (2018). Survey of the State of the Art in Natural Language Generation: Core Tasks, Applications and Evaluation. Journal of Artificial Intelligence Research, 61, 65-170.
- Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.
- Brown, T., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33.
- Reiter, E., & Dale, R. (2000). Building Natural Language Generation Systems. Cambridge University Press.
- MDEC. (2024). MyDigital Blueprint: AI as a Foundation. Malaysia Digital Economy Corporation.