Fine-Tuning with LoRA

Sam Naji, Joseph Tekriti
October 21, 2023
9 minute read
Table of Contents

The growing progress in neural network-based methodologies and research on large language models (LLMs) has led to a heightened interest among businesses in the potential of AI applications for generating value. The researchers utilize a range of machine learning methodologies, encompassing both generative and non-generative techniques, in order to tackle text-related obstacles, including categorization, summarization, sequence-to-sequence tasks, and controlled text synthesis. Organizations have the option to utilize third-party application programming interfaces (APIs). However, by refining models using proprietary data, they can obtain domain-specific and relevant outcomes. This approach allows for the implementation of cost-efficient and autonomous solutions that can be deployed across many environments while maintaining a high level of security. When deciding on a method for fine-tuning, it is essential to prioritize optimal resource use and low cost. In this article, we will discuss low rank adaptation (LoRA) and quantization, two of the most well-known and successful variations of these parameter-efficient approaches.

Remarkable Journey of Large Language Models in 2023

In just a few months of 2023, there have been a lot of big steps forward in the building and adaptation of large language models.

  1. Meta's LaMa: In late February, Meta introduced their large language model, LaMa. They chose to let everyone see its design, but they kept the model weights secret.
  2. Unexpected Turn: Not long after, in early March, the weights of LaMa were surprisingly leaked on 4chan. Meta didn't go after strict legal action; instead, they changed their license agreement to allow study use but limit commercial use.
  3. Rapid Adaptations: The open-source community wasted no time in harnessing the potential of LaMa. Tech fans had LaMa going on a Raspberry Pi by the middle of March. The next day, consumer-grade hardware, especially a 4090 graphics card, was used to make the model even better. This is what led to the LoRa method being used to make the alpaca models.
  4. Breakthroughs: At the end of March, a platform called llama.cpp was used to make the model work with MacBook CPUs. This wasn't just a compatibility fix; the model could now make tokens quickly and efficiently.
  5. Competition and Comparisons: Vicuna was released in the same month as Bard and was able to do similar things, but it cost a lot less—only $300 for fine-tuning. This was done with the help of the LoRa method.
  6. Open Source Progress: The release of GBT4All, which tried to build a complete ecosystem for big language models, was a big step forward. Notably, Cerebras produced an open-source model that was better than GPT-3 by the end of March. What does it mean? It wasn't possible to use this new model for business purposes like LaMa's weights were meant to be used.
  7. Proprietary Models: Bloomberg wasn't far behind, introducing BloombergGPT. It worked with a closed model and dataset, but its release was a big step forward in the process.
  8. Innovative Fine-tuning: April saw the emergence of Koala, a 13 billion parameter model. It was impressive that it was fine-tuned with only a small sample and $100 in funding. The end result? It is not different from ChatGPT.
  9. New Entrants: The Open Assistant model was released in the middle of April. It came with a dataset designed for reinforcement learning with human input. This method, which is similar to how ChatGPT was trained, was better than what ChatGPT could do.

The speed with which big language models have changed in such a short time shows how helpful the open-source community is, how far technology has come, and how much AI can do.à

LLM Timeline 2023

Low-Rank Adaptation (LoRA)

Low-Rank Adaptation of Large Language Models (LoRA) is based on a key observation: the difference between the optimized weights for a specific task and the initial pre-trained weights often exhibits "low intrinsic rank," such that it can be well approximated by a matrix of low rank. LoRa's main idea is to make it easier to train models by reducing the number of factors that can be changed. In this way, it can cut these values down by an amazing 10,000 times.


  • Memory Efficiency: One of the biggest problems with training neural networks, especially big ones, is that they use a lot of memory. LoRa provides an answer by reducing the GPU's memory needs by three times or more.
  • Performance: One amazing thing about LoRa is that it can keep or even improve the model's performance even though it uses less memory. In fact, fine-tuning with LoRa can sometimes work better than fine-tuning the whole model.

How does it work?

Delving deeper, the methodology behind LoRa is rooted in matrix factorization. Usually, when fine-tuning, new data would be added to the model weights that were already there. "Rank factorization" is what LoRa does instead of using a matrix of the same size for this. Practically, keeping the original weight matrix unchanged during fine-tuning is standard practice. In addition, two supplementary matrices, denoted as A and B, undergo fine-tuning. The matrices serve as a decomposition of the weight matrix that has been fine-tuned. Consider the following illustration from the original LoRA paper:

Lora Research Article

This figure from the publication shows tensor operations for one matrix in the model, A and B are tiny matrices. The original pre-trained weights and LoRA's fine-tuned, low-rank decomposition matrices process d simultaneously.

By choosing r<<d and freezing the original "Pretrained Weights" at the beginning of training, we may greatly reduce the memory footprint of the optimizer and the size of the checkpoint in comparison to fully adjusting all of the parameters. Any dense layer in the model architecture is suitable for this strategy. Numerous strategies that expand upon LoRA have been introduced since the publication of the original LoRA paper. Parameter-efficient techniques like LoRA improve model deployment, especially when managing multiple specialized models. This is growing more essential as specialized LLMs for certain activity become more common.


The field of machine learning continually seeks ways to optimize model operations without compromising on performance. Quantization emerges as a major technique in this quest, emphasizing efficient memory management and enhanced speed during inferences. It pertains to the representation of floating-point vectors, typically stored as 16-bit values, in lower-bit integer formats. For instance, 16-bit data can be transformed into 8-bit integers in some scenarios.


  • Memory Efficiency: The core advantage of quantization is its pronounced memory savings. By converting high-bit floating-point vectors to low-bit integer representations, the memory footprint diminishes significantly.
  • Faster Inference: Leveraging quantized vectors accelerates the model inference process, ensuring quicker responses during real-time operations.

Here's a conceptual breakdown:

  • Take the original 16-bit floating-point vector values.
  • Convert these values into integer representations, yielding an "8-bit quantized vector" as an example.
  • A quantization factor accompanies this process.
  • When the need arises to revert to the original values ("de-quantization"), the values in the quantized vector are divided by the quantization factor, approximating the initial floating-point values.

The empirical evidence is compelling. Evaluating a 7-billion parameter model reveals that as we employ various quantization techniques, the model size reduces almost linearly. The metric of "perplexity", which gauges a model's aptitude in compressing language, remains largely consistent across these quantization levels. Lower perplexity values indicate better performance, and it's observed that an 8-bit quantization yields an identical perplexity score to the 16-bit version, while 4-bit isn't too far off.


The pretrained model in QLoRA is put into GPU memory as quantized 4-bit weights (instead of 8-bits in LoRA), making it a more memory-efficient variant of LoRA without sacrificing performance. Here, we will concentrate on testing this strategy, contrasting it with another approach when appropriate, and determining the optimal values for the QLoRA hyperparameters to obtain top performance with little training time. 

Hugging Face's SFT Trainer class in the TRL library represents the most recent example of high-level abstraction. All that is required for QLoRA is the following:

  1. Using bits and bytes, load the 4-bit model into the GPU's memory.
  2. The prior discussion as a guide, define the LoRA setup above.
  3. Create Hugging Face Dataset objects from the prepared data for following instructions, dividing it into a train and test set.
  4. To train arguments, you must first define training hyperparameters, such as the number of epochs, batch size, and others, will be held constant.
  5. Pass these parameters to a running instance of SFT Trainer.

You can clone the repository for source file


In essence, the tandem of Low-Rank Adaptation (LoRa) and Quantization now allows us to fine-tune mammoth language models and conduct inferences on a more resource-efficient scale than ever before. When applied properly, Low Rank Adaptation is a potent fine-tuning technique that can produce excellent outcomes. The quality of the output from the optimized model may hinge on the rank value and the layers of the neural network architecture that are targeted during adaptation. QLoRA reduces the amount of storage needed for adaptation while keeping it at a high standard.

Join Our Newsletter

Stay informed with the latest in AI research, updates, and insights directly to your inbox

Subscribe Now

More our similar blogs

You might also like

November 28, 2023

Using Gen AI to reduce reliance on human labers


Sam Naji, Joseph Tekriti
November 25, 2023

Is That Picture Real?


Sam Naji, Joseph Tekriti
November 24, 2023

Advanced Prompting Frameworks


Sam Naji, Joseph Tekriti