AIWiki
Malaysia

Parameter-Efficient Fine-Tuning

A family of techniques that adapts a pretrained language or vision model to a downstream task by training only a small fraction of its parameters, dramatically reducing compute, memory, and storage requirements compared to full fine-tuning.

5 min readLast updated May 2026Infrastructure

Parameter-Efficient Fine-Tuning (PEFT) is a collection of techniques that adapts a large pretrained model to a specific downstream task or domain by training only a small subset of new or existing parameters, while keeping the vast majority of the base model frozen. PEFT methods have become the dominant approach to customising large language models, vision models, and multimodal systems because they slash the compute, memory, and storage costs of fine-tuning by one to two orders of magnitude while preserving most of the quality of a full fine-tune.

Motivation

Full fine-tuning updates every parameter of the base model. For modern language models with tens or hundreds of billions of parameters, this requires hundreds of gigabytes of GPU memory just to hold the model, optimiser states, and activations, and produces a complete copy of the model for every task. PEFT methods address all three problems: they cut training memory by 10x to 20x, they make storage for adapted models trivial (often megabytes instead of hundreds of gigabytes), and they make it possible to host many task-specific variants from a single shared base.

Core methods

LoRA (Low-Rank Adaptation)

LoRA is the most widely deployed PEFT method. It is based on the empirical observation that the weight update produced during fine-tuning has an intrinsically low rank. Instead of updating the original weight matrix W directly, LoRA learns two small matrices A and B such that the update is approximated as delta_W = A * B, where the rank of the product is much smaller than the rank of W. The original weights are frozen, and only A and B are trained. After training, the product can be merged back into the base weights with no inference-time overhead.

LoRA typically reduces trainable parameter count by 90% or more, frequently to under 1% of the original model. QLoRA combines LoRA with 4-bit quantisation of the base weights, allowing a 70-billion-parameter model to be fine-tuned on a single consumer GPU.

Adapters

Adapter modules were among the earliest PEFT techniques. They insert small bottleneck feedforward layers between the layers of the transformer. Only the adapters are trained. Adapters add a small inference-time cost and have largely been superseded by LoRA, but they remain useful in multi-task settings.

Prompt tuning and prefix tuning

Prompt tuning prepends a sequence of trainable continuous vectors — called soft prompts — to the input embeddings, and trains only those vectors. Prefix tuning extends this to every transformer layer. These methods are extremely parameter-efficient (often a few thousand parameters) but typically require larger base models to achieve competitive quality.

IA3, DoRA, NOLA, and newer variants

A growing family of methods refines the LoRA idea. IA3 rescales activations rather than updating weights. DoRA decomposes weight updates into magnitude and direction. NOLA uses random projections to further shrink the parameter footprint. Each trades off expressiveness, training stability, and parameter count differently.

Performance and trade-offs

In most benchmarks, well-tuned LoRA recovers between 90% and 100% of the quality of a full fine-tune on the same data. Quality gaps appear most often when the task domain is far from the pretraining distribution or when the rank is set too low. The practical recipe for production PEFT is to start with LoRA rank 8 or 16, attach adapters to attention and projection layers, and increase rank only if validation metrics demand it.

Ecosystem and tooling

The Hugging Face PEFT library is the de facto open-source toolkit for parameter-efficient fine-tuning and implements LoRA, QLoRA, IA3, prompt tuning, prefix tuning, and several newer methods. Production training stacks such as Axolotl, Unsloth, and Together Fine-Tuning wrap PEFT methods with curated recipes and distributed training support. Cloud providers including AWS, Azure, Google Cloud, and Together AI offer managed LoRA fine-tuning as a service.

See Also

References

References

  1. Hu, E., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
  2. Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314.
  3. Houlsby, N., et al. (2019). Parameter-Efficient Transfer Learning for NLP. ICML 2019.
  4. Hugging Face. (2024). PEFT Library Documentation. github.com/huggingface/peft.
  5. MIMOS Berhad. (2024). National AI Capability Reports. MIMOS.