AIWiki
Malaysia

Question Answering

Question answering is the natural language processing task of producing accurate answers to questions posed in natural language, often using information retrieval, reading comprehension, or large language models.

5 min readLast updated May 2026Applications

Question answering (QA) is the natural language processing task of producing an accurate answer to a question expressed in natural language. It spans a range of formulations from short-span answer extraction over a single paragraph to multi-hop reasoning across many documents, structured query answering over databases, and conversational answering grounded in private corpora.

Subtypes

QA is commonly classified along two dimensions. The first is source format: textual QA reads passages or documents, knowledge-base QA queries structured triples, table QA reasons over tabular data, and visual QA reads images alongside text. The second is answer form: extractive QA returns a span from the source, multiple-choice QA selects an option, abstractive QA generates a free-form answer, and yes/no QA returns a binary judgement.

Open-domain QA answers questions over a large external corpus and typically combines retrieval with reading comprehension. Closed-book QA forces the model to answer from parametric knowledge alone, with no retrieval at inference time. Conversational QA maintains dialogue state and resolves references to earlier turns.

Reading comprehension and SQuAD

Extractive reading comprehension was popularised by the Stanford Question Answering Dataset (SQuAD), introduced in 2016. SQuAD pairs Wikipedia paragraphs with crowdsourced questions and answer spans. Subsequent work scaled to SQuAD 2.0 with unanswerable questions, Natural Questions with real Google search queries, TriviaQA, HotpotQA for multi-hop reasoning, and DROP for discrete arithmetic reasoning. Transformer encoders such as BERT, RoBERTa, and ALBERT achieved human-level performance on SQuAD by 2019 by fine-tuning on the task-specific format.

Open-domain and retrieval-augmented QA

Open-domain QA decomposes into a retriever and a reader. Dense Passage Retrieval and downstream variants encode questions and passages into a shared vector space, with nearest-neighbour search returning relevant passages. The retrieved context is then read by a generative model. Retrieval-augmented generation (RAG), introduced by Facebook AI Research in 2020, unified retrieval and generation in a differentiable pipeline. Modern production QA almost universally uses RAG-style architectures with vector databases such as Pinecone, Weaviate, Qdrant, or Chroma.

Large language models as QA systems

Large language models perform QA in zero-shot or few-shot settings without task-specific fine-tuning, drawing on parametric knowledge acquired during pretraining. Chain-of-thought prompting improves reasoning-heavy QA by eliciting intermediate steps. Tool use and function calling extend QA to live data sources, calculators, and code execution. Hybrid systems pair an LLM with retrieval, structured knowledge graphs, or specialised tools to balance recall, factuality, and freshness.

Evaluation

Extractive QA is typically scored with exact-match and token-level F1 against reference spans. Multiple-choice QA uses accuracy. Generative QA requires more nuanced evaluation: ROUGE and BLEU capture surface similarity, while learned metrics, natural language inference for entailment, and human ratings assess faithfulness and helpfulness. Benchmark suites such as MMLU, BIG-Bench, and GPQA probe broader knowledge and reasoning, while domain-specific benchmarks such as MedQA, BioASQ, and LegalBench evaluate professional QA.

Common challenges

Hallucination, retrieval failure, multi-hop reasoning, temporal reasoning, ambiguity resolution, and adversarial robustness are persistent challenges. Faithfulness — answers being supported by retrieved evidence — is a central design objective for enterprise systems. Long-context, multilingual, and low-resource QA remain active research areas.

Applications

QA underpins consumer search experiences such as Google's AI Overviews and Bing's chat search, enterprise knowledge assistants over internal documentation, customer support bots, medical decision support, legal research tools, e-discovery, education platforms, and government service portals.

References

  1. Rajpurkar, P. et al. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. EMNLP.
  2. Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.
  3. Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
  4. Bank Negara Malaysia. (2024). Discussion Paper on Use of Artificial Intelligence in Financial Services. bnm.gov.my.