AIWiki
Malaysia

AI Benchmarking

The systematic evaluation of AI systems using standardised datasets, tasks, and metrics to measure capability, compare models, and track progress across research and deployment contexts.

6 min readLast updated June 2026Infrastructure

AI benchmarking is the practice of evaluating artificial intelligence systems using standardised datasets, tasks, and scoring metrics to measure their capabilities, compare them against other systems, and track the progress of the field over time. Benchmarks provide the empirical foundation for claims about AI capability, enable researchers and practitioners to select appropriate models for specific applications, and help safety researchers identify failure modes and limitations.

As AI systems have grown in scale and generality, benchmarking has evolved from narrow task-specific tests (accuracy on a particular dataset) to broad multi-task suites that probe reasoning, factual knowledge, coding ability, safety, and alignment. In 2025, frontier model benchmarking encompasses hundreds of distinct evaluations, and the benchmark ecosystem itself has become a subject of methodological scrutiny.

Purpose and Importance

Benchmarks serve several distinct functions in the AI ecosystem. For research, benchmarks provide common points of comparison across papers and organisations, enabling the field to measure whether new techniques represent genuine improvements or merely dataset-specific optimisations. For model selection, practitioners use benchmarks to choose models for specific applications. For safety and alignment, benchmarks measure model behaviour on harmful content generation, instruction following, truthfulness, and consistency. For procurement and governance, enterprises and governments increasingly require benchmark disclosures when acquiring AI systems.

General Language Understanding

MMLU (Massive Multitask Language Understanding, Hendrycks et al., 2020) evaluates models across 57 subjects including mathematics, law, medicine, history, and computer science, drawn from academic examinations. MMLU became the most widely cited general knowledge benchmark for LLMs. MMLU-Pro (Wang et al., 2024) is a harder variant with 10-choice questions instead of 4, causing a 16-33 percent accuracy drop compared to the original benchmark and addressing ceiling effects.

BIG-Bench (Srivastava et al., 2022) is a collaborative benchmark comprising over 200 tasks designed to probe capabilities beyond standard language understanding. BIG-Bench Hard selects the 23 most challenging tasks for evaluation.

HELM (Holistic Evaluation of Language Models, Liang et al., 2022) from Stanford assesses models across multiple scenarios and metrics simultaneously, including accuracy, calibration, robustness, fairness, and efficiency, providing a multi-dimensional profile rather than a single score.

Coding and Mathematical Benchmarks

HumanEval (Chen et al., 2021) evaluates code generation through 164 Python programming problems, testing functional correctness using unit tests. It remains the most widely cited coding benchmark. MBPP (Mostly Basic Python Programs) and LiveCodeBench — which sources problems from competitive programming contests to avoid contamination — provide complementary evaluations.

GSM8K (Grade School Math, Cobbe et al., 2021) contains 8,500 grade-school mathematics problems requiring multi-step reasoning. MATH (Hendrycks et al., 2021) is a harder dataset of competition-level mathematics.

AIME (American Invitational Mathematics Examination) problems have become the standard frontier benchmark for mathematical reasoning as of 2025. Qwen3.5-plus scored 91.3 percent on AIME 2026, and GPT-5.3 Codex achieved 94 percent on AIME 2025, reflecting the dramatic capability gains of reasoning models trained with reinforcement learning.

Reasoning, Knowledge, and Safety

ARC (AI2 Reasoning Challenge) distinguishes easy and challenge sets of science questions. HellaSwag evaluates commonsense natural language inference. TruthfulQA measures the tendency of models to generate truthful rather than plausible-sounding false answers. WinoGrande tests commonsense reasoning via pronoun resolution.

Humanity's Last Exam (HLE, 2025) is a crowdsourced benchmark of over 3,000 expert-level questions across diverse academic disciplines, designed to be unsolvable by generalist AI and targeting the frontier of model capability. Safety-focused benchmarks such as HarmBench, SORRY-Bench, and WildGuard probe model behaviour on harmful content generation and instruction following in adversarial settings.

Multimodal Benchmarks

MMBench, MMMU (Massive Multidiscipline Multimodal Understanding), and MMStar evaluate vision-language models across image understanding, chart interpretation, and scientific reasoning. DocVQA and TextVQA test reading comprehension within document images. Video-MME assesses understanding of video content.

Benchmark Contamination and Limitations

A critical methodological concern is benchmark contamination: if a model's training data contains the questions or answers from a benchmark, its reported performance overstates true capability. As benchmarks are published, their problems gradually appear in internet crawls and subsequent training sets, inflating scores over time. Dynamic or continually updated benchmarks — such as LiveCodeBench and Chatbot Arena (which uses real user preferences rather than fixed test sets) — partially address this problem.

Goodhart's Law applies acutely to AI benchmarking: as a benchmark becomes a target for optimisation, it ceases to be a good measure of the underlying capability it was designed to probe. Benchmark scores also fail to capture deployment-relevant properties such as latency, cost, context length handling, and safety behaviour across diverse user populations.

See Also

References

References

  1. Hendrycks, D., et al. (2020). Measuring massive multitask language understanding. arXiv:2009.03300.
  2. Chen, M., et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374. OpenAI.
  3. Liang, P., et al. (2022). Holistic evaluation of language models. arXiv:2211.09110. Stanford.
  4. Wang, Y., et al. (2024). MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. NeurIPS 2024.
  5. Analytics Vidhya. (2026). Guide to AI benchmarks: MMLU, HumanEval, and more explained. analyticsvidhya.com.