What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Instruction Tuning

Instruction tuning is a supervised fine-tuning technique that trains large language models on datasets of instruction-response pairs, enabling models to follow natural language directions and generalise to unseen tasks in a zero-shot or few-shot setting.

7 min readLast updated June 2026Foundations

Instruction tuning (also called supervised fine-tuning or instruction fine-tuning) is a technique for adapting pre-trained large language models to follow natural language instructions. A pre-trained language model, trained on vast amounts of internet text to predict the next token, does not automatically exhibit instruction-following behaviour: it is more likely to continue a prompt in the style of its training data than to actually answer a question or complete a directed task. Instruction tuning addresses this by further training the model on a curated dataset of (instruction, output) pairs that span a wide range of tasks -- question answering, summarisation, translation, code generation, classification, and more -- so that the model learns to interpret directives and respond to them appropriately. Models trained with instruction tuning generalise well to new instructions they have not seen during training, a property known as zero-shot task generalisation.

Background

The importance of instruction following as a distinct capability became clear in 2021-2022 as researchers observed that even very large pre-trained language models would, when given a directive such as "Summarise the following article", sometimes respond by generating additional similar-looking articles rather than producing a summary. The model was doing next-token prediction rather than task execution. This motivated several research directions that converged on instruction tuning as a practical solution.

Early foundational work includes FLAN (Fine-tuned Language Net), from Google Brain, which fine-tuned a 137-billion-parameter language model on a mixture of more than 60 NLP benchmark datasets reframed as natural language instruction templates. FLAN demonstrated that this multi-task instruction fine-tuning dramatically improved zero-shot performance on held-out tasks, establishing the paradigm. InstructGPT, from OpenAI, combined supervised fine-tuning on human-written demonstrations with reinforcement learning from human feedback (RLHF), producing models that were not only capable of following instructions but were consistently preferred by human evaluators over much larger base models that had not undergone instruction tuning.

Training Process

Instruction tuning follows a standard supervised learning setup. The training data consists of examples where each input is a natural language instruction (optionally combined with context such as a document to summarise or a code snippet to debug) and each output is the desired response. The model is trained to maximise the log-likelihood of the target response given the instruction-plus-context input, using the same cross-entropy loss used in standard language model pre-training.

Data quality and diversity are the most critical factors in instruction tuning. High-quality instruction datasets cover diverse task types, domains, and instruction phrasings, preventing the model from learning to respond only to a narrow style of prompt. Diversity in instruction complexity and length is also important: models tuned only on simple instructions may struggle with complex multi-step directives. Research has consistently shown that a smaller set of high-quality, diverse examples produces better instruction-following behaviour than a larger set of low-quality or repetitive examples.

Data Sources and Curation

Instruction tuning datasets have been assembled through several approaches. Human-written demonstrations from domain experts or crowdworkers (as used in InstructGPT) provide high quality but are expensive to produce at scale. Self-instruct methods use the model itself or a more capable teacher model to generate instruction-response pairs from a small seed set, dramatically reducing annotation cost. This approach was used in Stanford Alpaca and its successors including Vicuna and WizardLM. Flan-style datasets convert existing NLP benchmarks into instruction format by writing natural language templates for each task. By 2025, large-scale community-curated instruction datasets such as OpenHermes, Tulu-3, and various Alpaca-format corpora were publicly available and widely used for fine-tuning open-weight models.

Relation to RLHF and DPO

Instruction tuning is typically the first stage of a multi-stage alignment pipeline. After instruction tuning produces a capable instruction-following model (the SFT model), a second alignment stage -- using RLHF or Direct Preference Optimization -- refines the model's responses to match human preferences more closely in terms of helpfulness, harmlessness, and honesty. The SFT model serves as the reference model for DPO training and the starting policy for RLHF. Instruction tuning and preference alignment are therefore complementary: instruction tuning instils task-following capability, while preference alignment shapes the style, tone, and safety of responses.

Impact

Instruction tuning transformed large language models from raw text continuation engines into practical assistants. ChatGPT, Claude, Gemini, and virtually every modern conversational AI system relies on some form of instruction tuning as a foundational step. The technique has also been applied to domain-specific models in medical, legal, coding, and scientific domains to instil both instruction following and domain-specific knowledge simultaneously. Parameter-efficient fine-tuning methods such as LoRA and QLoRA have made instruction tuning accessible on consumer-grade hardware, enabling organisations to fine-tune large models on their own instruction datasets with modest GPU resources.

Malaysian Context — Instruction Tuning for Bahasa Malaysia and Local Domains

Instruction tuning is directly applicable to the challenge of building AI assistants that work fluently in Bahasa Malaysia and respond appropriately to Malaysian regulatory and cultural contexts. While major commercially available models such as GPT-4, Claude, and Gemini have been instruction-tuned on English-dominant datasets, their instruction-following capabilities in Bahasa Malaysia, Malaysian English, and local dialects are uneven. Malaysian AI practitioners and researchers looking to improve model performance in these languages commonly apply additional instruction tuning on locally curated datasets.

Initiatives under the MyDigital Blueprint and MDEC's AI development programmes have encouraged the creation of Bahasa Malaysia NLP datasets, some of which can be reformulated into instruction tuning format. Government agencies such as MAMPU (Malaysian Administrative Modernisation and Management Planning Unit) exploring AI assistants for public service delivery have a direct interest in instruction-tuned models that understand Malaysian administrative context, legislation, and public service procedures.

Malaysian financial institutions including Maybank, CIMB, and Bank Islam have explored instruction tuning of LLMs on banking domain datasets to produce financial AI assistants that follow product-specific and regulatory-specific instructions consistently. These fine-tuned models are subject to BNM's model risk management guidelines, which require validation and monitoring of model behaviour before and during deployment.

HRD Corp-funded training programmes at Malaysian AI bootcamps and universities increasingly cover instruction tuning as a practical fine-tuning skill, reflecting demand from companies looking to adapt open-weight models such as Llama and Mistral to Malaysian use cases. Multimedia University (MMU) and Universiti Teknologi Malaysia have published research on instruction tuning of multilingual models for Southeast Asian languages, contributing to the regional pool of instruction-tuned models covering Bahasa Malaysia and related languages.

References

Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned Language Models are Zero-Shot Learners. Proceedings of ICLR 2022. arXiv:2109.01652.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in NeurIPS 2022. arXiv:2203.02155.
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2023). Self-Instruct: Aligning Language Models with Self-Generated Instructions. Proceedings of ACL 2023. arXiv:2212.10560.
IBM. (2024). What Is Instruction Tuning?. IBM Think. https://www.ibm.com/think/topics/instruction-tuning

Tags:fine-tuning nlp alignment supervised-learning

Also known as	Supervised fine-tuning (SFT), instruction fine-tuning
Type	Supervised training technique
Key models	FLAN, InstructGPT, Alpaca, Vicuna, Tulu
Training data	Instruction-response pairs across diverse tasks
Key capability	Zero-shot generalisation to unseen tasks
Related	RLHF, Fine-tuning, Direct Preference Optimization, Few-shot Learning

Background

Training Process

Data Sources and Curation

Relation to RLHF and DPO

Impact

See Also

References

References