LoRA (Low-Rank Adaptation)
LoRA is a parameter-efficient fine-tuning technique that adapts large pre-trained models by injecting small trainable low-rank matrices into transformer layers, drastically reducing the number of trainable parameters without sacrificing performance.
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that allows large pre-trained language models — and, increasingly, diffusion models and vision models — to be adapted for specific tasks by training only a small number of additional parameters while keeping the original model weights frozen. Introduced by Edward Hu, Yelong Shen, and colleagues at Microsoft Research in 2021, LoRA has become one of the most widely deployed fine-tuning techniques in the industry, enabling practitioners to customise foundation models at a fraction of the computational and financial cost of full fine-tuning.
The Problem LoRA Solves
Full fine-tuning of a large language model requires updating every weight in the model on task-specific data. For models with tens or hundreds of billions of parameters — such as GPT-4, LLaMA 3, or Mistral Large — this demands GPU clusters with hundreds of gigabytes of memory and days or weeks of training time. The resulting fine-tuned model also occupies the same storage footprint as the original, making it impractical to maintain separate fine-tuned versions for multiple downstream tasks.
LoRA addresses this by operating on the hypothesis that the weight updates required for task adaptation have low intrinsic rank. That is, although each weight matrix in a transformer is large — potentially thousands of dimensions on each side — the meaningful change to that matrix during adaptation can be approximated by a much smaller matrix formed from the outer product of two low-dimensional vectors.
Mathematical Foundation
Given a weight matrix W ∈ ℝ^(d×k) in the pre-trained model, LoRA learns two small matrices: A ∈ ℝ^(r×k) and B ∈ ℝ^(d×r), where r is the rank hyperparameter and r ≪ min(d, k). During the forward pass, the effective weight used is W + BA, where BA is the low-rank approximation of the update ΔW. The matrices A and B are initialised so that BA = 0 at the start of training (A is randomly initialised; B is zero-initialised), ensuring the model begins fine-tuning from the pre-trained baseline.
Only A and B are trained; W remains frozen throughout. The rank r controls the capacity of the adaptation — higher rank allows the model to learn more complex task-specific transformations at the cost of more trainable parameters. In practice, r values of 4 to 64 are common, and the ratio of trainable to total parameters is typically 0.01%–1%.
Efficiency Gains
The reduction in trainable parameters is substantial. Compared to full fine-tuning of GPT-3 175B using the Adam optimiser, LoRA reduces the number of trainable parameters by approximately 10,000 times and GPU memory requirements by roughly three times. Because the LoRA matrices A and B can be merged directly back into the original weight matrix W after training — by computing W' = W + BA — there is zero additional inference latency. This contrasts with adapter-based methods that insert additional computations into the forward pass.
QLoRA: Combining Quantisation and LoRA
QLoRA, introduced by Tim Dettmers and colleagues in 2023, combines LoRA with 4-bit quantisation of the base model weights. The base model is loaded in a compressed 4-bit format (using NormalFloat4, a quantisation scheme optimised for normally distributed weights), dramatically reducing memory requirements, while LoRA adapters are trained in higher precision. QLoRA makes it feasible to fine-tune models with tens of billions of parameters on a single consumer GPU, democratising access to custom model development. A 65B parameter model can be fine-tuned on a single 48 GB GPU using QLoRA, a task that would previously have required a multi-GPU cluster.
Application to Diffusion and Vision Models
LoRA was initially developed for language models but has been widely adopted for fine-tuning text-to-image diffusion models such as Stable Diffusion. In this context, LoRA adapters encode the visual style, subject, or composition of a small set of reference images, allowing the base model to generate new images consistent with the learned concept without modifying the base model itself. Platforms such as Civitai host thousands of community-trained LoRA adapters for Stable Diffusion, illustrating the breadth of the technique's adoption.
Variants and Extensions
DoRA (Decomposed LoRA) decomposes the weight update into magnitude and direction components, training them separately and generally achieving better performance at the same rank. AdaLoRA adaptively allocates the rank budget across different weight matrices based on their estimated importance, concentrating capacity where it matters most. LoftQ initialises the LoRA matrices based on the quantisation error of the base model, improving convergence when combined with quantisation.
References
- Hu, E. J., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-rank adaptation of large language models. arXiv:2106.09685.
- Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv:2305.14314.
- Liu, S., Gu, T., Hu, E. J., et al. (2024). DoRA: Weight-decomposed low-rank adaptation. ICML 2024.
- Zhang, Q., Chen, M., Bukharin, A., et al. (2023). AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. ICLR 2023.
- IBM. (2025). What is LoRA (Low-Rank Adaptation)?. IBM Think. https://www.ibm.com/think/topics/lora
- Raschka, S. (2023). Practical tips for finetuning LLMs using LoRA. Ahead of AI Newsletter.