What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Parameter-Efficient Fine-Tuning

A family of techniques that adapts a pretrained language or vision model to a downstream task by training only a small fraction of its parameters, dramatically reducing compute, memory, and storage requirements compared to full fine-tuning.

5 min readLast updated May 2026Infrastructure

Parameter-Efficient Fine-Tuning (PEFT) is a collection of techniques that adapts a large pretrained model to a specific downstream task or domain by training only a small subset of new or existing parameters, while keeping the vast majority of the base model frozen. PEFT methods have become the dominant approach to customising large language models, vision models, and multimodal systems because they slash the compute, memory, and storage costs of fine-tuning by one to two orders of magnitude while preserving most of the quality of a full fine-tune.

Motivation

Full fine-tuning updates every parameter of the base model. For modern language models with tens or hundreds of billions of parameters, this requires hundreds of gigabytes of GPU memory just to hold the model, optimiser states, and activations, and produces a complete copy of the model for every task. PEFT methods address all three problems: they cut training memory by 10x to 20x, they make storage for adapted models trivial (often megabytes instead of hundreds of gigabytes), and they make it possible to host many task-specific variants from a single shared base.

Core methods

LoRA (Low-Rank Adaptation)

LoRA is the most widely deployed PEFT method. It is based on the empirical observation that the weight update produced during fine-tuning has an intrinsically low rank. Instead of updating the original weight matrix W directly, LoRA learns two small matrices A and B such that the update is approximated as delta_W = A * B, where the rank of the product is much smaller than the rank of W. The original weights are frozen, and only A and B are trained. After training, the product can be merged back into the base weights with no inference-time overhead.

LoRA typically reduces trainable parameter count by 90% or more, frequently to under 1% of the original model. QLoRA combines LoRA with 4-bit quantisation of the base weights, allowing a 70-billion-parameter model to be fine-tuned on a single consumer GPU.

Adapters

Adapter modules were among the earliest PEFT techniques. They insert small bottleneck feedforward layers between the layers of the transformer. Only the adapters are trained. Adapters add a small inference-time cost and have largely been superseded by LoRA, but they remain useful in multi-task settings.

Prompt tuning and prefix tuning

Prompt tuning prepends a sequence of trainable continuous vectors — called soft prompts — to the input embeddings, and trains only those vectors. Prefix tuning extends this to every transformer layer. These methods are extremely parameter-efficient (often a few thousand parameters) but typically require larger base models to achieve competitive quality.

IA3, DoRA, NOLA, and newer variants

A growing family of methods refines the LoRA idea. IA3 rescales activations rather than updating weights. DoRA decomposes weight updates into magnitude and direction. NOLA uses random projections to further shrink the parameter footprint. Each trades off expressiveness, training stability, and parameter count differently.

Performance and trade-offs

In most benchmarks, well-tuned LoRA recovers between 90% and 100% of the quality of a full fine-tune on the same data. Quality gaps appear most often when the task domain is far from the pretraining distribution or when the rank is set too low. The practical recipe for production PEFT is to start with LoRA rank 8 or 16, attach adapters to attention and projection layers, and increase rank only if validation metrics demand it.

Ecosystem and tooling

The Hugging Face PEFT library is the de facto open-source toolkit for parameter-efficient fine-tuning and implements LoRA, QLoRA, IA3, prompt tuning, prefix tuning, and several newer methods. Production training stacks such as Axolotl, Unsloth, and Together Fine-Tuning wrap PEFT methods with curated recipes and distributed training support. Cloud providers including AWS, Azure, Google Cloud, and Together AI offer managed LoRA fine-tuning as a service.

Malaysian Context — Sovereign and Sectoral Model Adaptation

PEFT is particularly relevant in Malaysia because it makes it practical to adapt frontier open-weight models to local needs without the capital expenditure of training from scratch. Malaysian organisations have used LoRA and QLoRA to adapt models such as Llama, Mistral, and Qwen for Bahasa Malaysia language tasks, Malay-English code-mixed conversation, and domain-specific corpora drawn from Malaysian regulatory texts.

The MIMOS Berhad national research and development centre, together with universities including Universiti Malaya, Universiti Sains Malaysia, and Universiti Teknologi Malaysia, has explored PEFT for adapting open models on locally hosted GPU clusters. The MyDigital Blueprint and the MDEC AI agenda encourage this kind of low-cost adaptation as a path to sovereign AI capability for Bahasa Malaysia and other local languages including Tamil, Mandarin, and Iban.

Sectoral adaptation is also gaining momentum. Maybank, CIMB, and several other Bank Negara Malaysia-regulated institutions have explored LoRA fine-tuning of open models on internal compliance, fraud, and customer-service corpora, retaining the base weights on shared infrastructure while keeping task-specific adapters within the bank's data perimeter — a deployment pattern that aligns well with the Personal Data Protection Act 2010 (PDPA) and BNM RMiT requirements.

HRD Corp has approved several PEFT-related training programmes for Malaysian engineers, and Cyberjaya-based service providers including AITG Sdn Bhd integrate LoRA workflows into their Teragrid Ai Platform for client-specific model adaptation.

References

Hu, E., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314.
Houlsby, N., et al. (2019). Parameter-Efficient Transfer Learning for NLP. ICML 2019.
Hugging Face. (2024). PEFT Library Documentation. github.com/huggingface/peft.
MIMOS Berhad. (2024). National AI Capability Reports. MIMOS.

Tags:peft lora fine-tuning adapters efficiency

Type	Model adaptation technique
Key methods	LoRA, QLoRA, adapters, prefix tuning
Typical reduction	90 – 99% fewer trainable parameters
Memory savings	10 – 20x vs full fine-tuning
Quality retained	Typically 90 – 100% of full fine-tune

Motivation

Core methods

LoRA (Low-Rank Adaptation)

Adapters

Prompt tuning and prefix tuning

IA3, DoRA, NOLA, and newer variants

Performance and trade-offs

Ecosystem and tooling

See Also

References

References