What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

AI Benchmarking

The systematic evaluation of AI systems using standardised datasets, tasks, and metrics to measure capability, compare models, and track progress across research and deployment contexts.

6 min readLast updated June 2026Infrastructure

AI benchmarking is the practice of evaluating artificial intelligence systems using standardised datasets, tasks, and scoring metrics to measure their capabilities, compare them against other systems, and track the progress of the field over time. Benchmarks provide the empirical foundation for claims about AI capability, enable researchers and practitioners to select appropriate models for specific applications, and help safety researchers identify failure modes and limitations.

As AI systems have grown in scale and generality, benchmarking has evolved from narrow task-specific tests (accuracy on a particular dataset) to broad multi-task suites that probe reasoning, factual knowledge, coding ability, safety, and alignment. In 2025, frontier model benchmarking encompasses hundreds of distinct evaluations, and the benchmark ecosystem itself has become a subject of methodological scrutiny.

Purpose and Importance

Benchmarks serve several distinct functions in the AI ecosystem. For research, benchmarks provide common points of comparison across papers and organisations, enabling the field to measure whether new techniques represent genuine improvements or merely dataset-specific optimisations. For model selection, practitioners use benchmarks to choose models for specific applications. For safety and alignment, benchmarks measure model behaviour on harmful content generation, instruction following, truthfulness, and consistency. For procurement and governance, enterprises and governments increasingly require benchmark disclosures when acquiring AI systems.

General Language Understanding

MMLU (Massive Multitask Language Understanding, Hendrycks et al., 2020) evaluates models across 57 subjects including mathematics, law, medicine, history, and computer science, drawn from academic examinations. MMLU became the most widely cited general knowledge benchmark for LLMs. MMLU-Pro (Wang et al., 2024) is a harder variant with 10-choice questions instead of 4, causing a 16-33 percent accuracy drop compared to the original benchmark and addressing ceiling effects.

BIG-Bench (Srivastava et al., 2022) is a collaborative benchmark comprising over 200 tasks designed to probe capabilities beyond standard language understanding. BIG-Bench Hard selects the 23 most challenging tasks for evaluation.

HELM (Holistic Evaluation of Language Models, Liang et al., 2022) from Stanford assesses models across multiple scenarios and metrics simultaneously, including accuracy, calibration, robustness, fairness, and efficiency, providing a multi-dimensional profile rather than a single score.

Coding and Mathematical Benchmarks

HumanEval (Chen et al., 2021) evaluates code generation through 164 Python programming problems, testing functional correctness using unit tests. It remains the most widely cited coding benchmark. MBPP (Mostly Basic Python Programs) and LiveCodeBench — which sources problems from competitive programming contests to avoid contamination — provide complementary evaluations.

GSM8K (Grade School Math, Cobbe et al., 2021) contains 8,500 grade-school mathematics problems requiring multi-step reasoning. MATH (Hendrycks et al., 2021) is a harder dataset of competition-level mathematics.

AIME (American Invitational Mathematics Examination) problems have become the standard frontier benchmark for mathematical reasoning as of 2025. Qwen3.5-plus scored 91.3 percent on AIME 2026, and GPT-5.3 Codex achieved 94 percent on AIME 2025, reflecting the dramatic capability gains of reasoning models trained with reinforcement learning.

Reasoning, Knowledge, and Safety

ARC (AI2 Reasoning Challenge) distinguishes easy and challenge sets of science questions. HellaSwag evaluates commonsense natural language inference. TruthfulQA measures the tendency of models to generate truthful rather than plausible-sounding false answers. WinoGrande tests commonsense reasoning via pronoun resolution.

Humanity's Last Exam (HLE, 2025) is a crowdsourced benchmark of over 3,000 expert-level questions across diverse academic disciplines, designed to be unsolvable by generalist AI and targeting the frontier of model capability. Safety-focused benchmarks such as HarmBench, SORRY-Bench, and WildGuard probe model behaviour on harmful content generation and instruction following in adversarial settings.

Multimodal Benchmarks

MMBench, MMMU (Massive Multidiscipline Multimodal Understanding), and MMStar evaluate vision-language models across image understanding, chart interpretation, and scientific reasoning. DocVQA and TextVQA test reading comprehension within document images. Video-MME assesses understanding of video content.

Benchmark Contamination and Limitations

A critical methodological concern is benchmark contamination: if a model's training data contains the questions or answers from a benchmark, its reported performance overstates true capability. As benchmarks are published, their problems gradually appear in internet crawls and subsequent training sets, inflating scores over time. Dynamic or continually updated benchmarks — such as LiveCodeBench and Chatbot Arena (which uses real user preferences rather than fixed test sets) — partially address this problem.

Goodhart's Law applies acutely to AI benchmarking: as a benchmark becomes a target for optimisation, it ceases to be a good measure of the underlying capability it was designed to probe. Benchmark scores also fail to capture deployment-relevant properties such as latency, cost, context length handling, and safety behaviour across diverse user populations.

Malaysian Context — Benchmark Literacy and AI Procurement

AI benchmarking is becoming relevant to Malaysian enterprises and government agencies as they evaluate and procure AI systems. The National AI Office Malaysia and MDEC have begun developing guidelines for responsible AI procurement, with benchmark transparency — requiring vendors to disclose evaluation results on standard tests — as an emerging expectation for public sector AI deployments.

Malaysian universities, including Universiti Malaya (UM), Universiti Teknologi Malaysia (UTM), and Universiti Putra Malaysia (UPM), publish research evaluating model performance on benchmarks relevant to Bahasa Malaysia and Southeast Asian contexts, including translation quality on Malay-English pairs and factual accuracy about Malaysian geography, law, and culture. Existing global benchmarks such as MMLU do not adequately test knowledge relevant to Malaysian legal, regulatory, and cultural contexts, motivating the development of localised evaluation sets.

MDEC's AI for Industry programme, which targets adoption in manufacturing, logistics, and financial services, is beginning to incorporate benchmark requirements into AI vendor qualification criteria. Vendors seeking to deploy AI solutions in regulated sectors — banking, healthcare, insurance — are expected to provide evidence of performance on relevant capability and safety benchmarks as part of their compliance documentation submitted to BNM, SC, and the Ministry of Health.

The ASEAN AI governance context is also relevant: the ASEAN Guide on AI Governance and Ethics (2024) encourages member states to develop shared evaluation frameworks for AI systems deployed across the region. Malaysia's participation in these multilateral discussions, through the National AI Office and MDEC, positions the country to influence regional benchmark standards that reflect Southeast Asian linguistic, cultural, and regulatory diversity.

For Malaysian AI startups seeking to commercialise models or AI-powered products, credible benchmark results are increasingly important signals for investors, enterprise buyers, and government procurement officers. Organisations such as Cradle Fund and MDEC's Malaysia Tech Entrepreneur Programme (MTEP) are beginning to ask portfolio companies to report standard benchmark performance as part of technology due diligence.

References

Hendrycks, D., et al. (2020). Measuring massive multitask language understanding. arXiv:2009.03300.
Chen, M., et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374. OpenAI.
Liang, P., et al. (2022). Holistic evaluation of language models. arXiv:2211.09110. Stanford.
Wang, Y., et al. (2024). MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. NeurIPS 2024.
Analytics Vidhya. (2026). Guide to AI benchmarks: MMLU, HumanEval, and more explained. analyticsvidhya.com.

Tags:AI benchmark MMLU HumanEval LLM evaluation model evaluation

Type	Evaluation methodology
Key benchmarks	MMLU, HumanEval, GSM8K, HELM, BIG-Bench, AIME
Maintained by	Academic labs, Hugging Face, EleutherAI, NIST
Key use	Model comparison, capability tracking, safety evaluation
Related	Large language models, reasoning models, AI safety, model serving

Purpose and Importance

General Language Understanding

Coding and Mathematical Benchmarks

Reasoning, Knowledge, and Safety

Multimodal Benchmarks

Benchmark Contamination and Limitations

See Also

References

References