Tracking the Explosive World of Generative AI

Novel QLoRA Approach Unlocks AI Fine-Tuning on Consumer GPUs

Researchers have unveiled QLoRA, a novel and highly efficient 4-bit method for fine-tuning AI models that is able to run on single professional and consumer GPUs. This dramatic increase in efficiency opens up new pathways for AI development at low cost.

A novel approach to fine-tune LLMs, named QLoRA, opens up a world of possibilities for improving AI language models. Photo illustration: Artisana

🧠 Stay Ahead of the Curve

Researchers have unveiled QLoRA, a novel and highly efficient 4-bit method for fine-tuning AI models that is comparable in performance to costly 16-bit approaches.
QLoRA’s performance capability dramatically reduces memory use and costs, making advanced AI fine-tuning accessible on single professional and consumer GPUs.
The release of QLoRA could further democratize AI development, foster open-source advancements, and lead to widespread personalized AI models, even on mobile phones.

By Michael Zhang

May 24, 2023

A breakthrough in artificial intelligence (AI) fine-tuning is on the horizon as researchers released QLoRA, a novel method for refining AI models on devices as compact as consumer GPUs without compromising performance compared to classic 16-bit fine-tuning methods. This innovative process, typically conducted on powerful, costly hardware, may further democratize the development of personalized chatbots and drive further advancement in the field of open-source language models.

Unpacking the Methodology

QLoRA, short for Quantized Low Rank Adaptaion, improves upon the existing LoRA approach, a popular, cost-effective way to enhance AI models or modify their behavior. Instead of re-training models from scratch — a significant cost and drain on resources — LoRA allows researchers to refine existing pretrained AI models for better performance.

Since Meta's LLaMA language model debuted in open-source format, researchers have used LoRA to improve the base model, leading to the development of advanced offshoots such as Vicuna and Alpaca. Google AI engineer Luke Sernau, the author of a notable memo that said neither Google nor OpenAI had any moats with their AI models, identified this rise of LoRA fine-tuning as a pivotal force enabling open-source models to rival their closed-source counterparts.

QLoRA represents a further leap in efficiency, delivering similar performance to existing techniques at a fraction of the cost, by introducing several noteworthy innovations:

A 4-bit NormalFloat data type significantly reduces memory usage while maintaining better precision than classic 4-bit data types, like integers and floats.
A double quantization method further optimizes memory usage by quantizing the quantization constants themselves.
Paged Optimizers enhance memory management, effectively smoothing out spikes typically associated with processing longer data sequences.

In their investigation, the researchers employed QLoRA to train a thousand models with varying parameter sizes, instruction tuning data sets, and model architectures. Among these, they handpicked a few standout models, collectively named Guanaco, to showcase in their research report. A live demo of one of the Guanaco language models is available here.

To assess performance between chatbots, various AI models were evaluated using a tournament-style benchmarking approach in a contest graded by both GPT-4 and human annotators. This same approach of using GPT-4 to select which preferred responses was notably used by the researchers behind the Vicuna language model, but the researchers behind QLoRA also note that the Vicuna benchmark method has weaknesses that are important to understand.

Vast Efficiency Gains with High Performance

The crowning achievement of QLoRA is its ability to dramatically shrink the memory footprint during model fine-tuning. A 65-billion parameter chatbot, fine-tuned with QLoRA, matched the performance of a 16-bit fine-tuned counterpart but required substantially less memory—just 48GB of GPU memory compared to the latter's whopping 780GB. As a result, QLoRA paves the way for fine-tuning even the largest public models on single professional GPUs, such as Nvidia’s A100.

Additionally, the researchers discovered that fine-tuning with QLoRA is possible on consumer GPUs. They successfully trained a 33-billion parameter model on a 24GB consumer GPU in under 12 hours, all without any loss of performance compared to the 16-bit baseline.

When benchmarked against GPT-4 and human-evaluated scoring, their fine-tuned Guanaco 65B model achieved 99.8% of the performance of ChatGPT (GPT3.5) on the Vicuna benchmark. Remarkably, the Guanaco 33B model, trained on a 24GB consumer GPU, scored 97.8% of ChatGPT's performance level—an impressive result given the limited fine-tuning.

The researchers fully acknowledge that the Vicuna benchmark is useful but also flawed. In asking both GPT-4 and human evaluators to pick preferred answers from language models, they found that GPT-4 would assign higher scores to the system first appearing in the prompts it evaluated.

Agreement between systems also varied. Human annotators and even the researchers themselves seemed to disagree on preferred responses, and the highest quality models generated the most disagreement on preferred answers across all grading methods. Nonetheless, absent a more robust method of comparing language model performance, the researchers determined that the Vicuna benchmark was still a valuable approach.

In having human raters and GPT-4 judge model outputs across the 80 prompts in the Vicuna benchmark, both GPT-4 and humans largely preferred Guanaco’s 65B and 33B models to ChatGPT-3.5 Turbo, though the edge was small (a 10-point difference in Elo is about a 1.5% win rate difference).

QLoRA's performance across human and GPT-4 evaluation in various benchmarks, conducted in a tournament-style contest. Photo credit: arXiV

To enable readers to test their own skills judging ChatGPT-3.5 responses versus Guanaco-generated responses, the researchers have made this interactive app available to all.

Implications of QLoRA for the Future of LLMs

The advent of QLoRA could fuel a surge in the development of cutting-edge language models as researchers and consumers no longer need expensive computing resources to produce new models. More importantly, QLoRA's vastly efficient memory requirements could make it possible to fine-tune models on mobile devices—a prospect currently unrealized.

In a milestone moment for AI, researchers believe the vastly more efficient memory requirements offered by QLoRA would enable mobile phones to fine-tune models. While 7-billion parameter models can already run on phones, fine-tuning has not been possible to date. Personalized models and improving performance, all done locally with the benefits of data privacy, would now be possible on an iPhone 12, which the researchers benchmarked as able to finetune 3 million tokens per night.

Open-source advocates should have much to cheer for. In his leaked Google memo, AI engineer Luke Sernau wrote that “being able to personalize a language model in a few hours on consumer hardware is a big deal, particularly for aspirations that involve incorporating new and diverse knowledge in near real-time.” That future may be arriving mere months after Sernau’s memo.

In a testament to the rapid strides of the open-source community, QLoRA's 4-bit quantization is now available in Hugging Face transformers. This feature allows every model to operate at 4-bit with just a command line change.

Embracing the potential of this innovation, Andrej Karpathy, OpenAI lead researcher and former head of Tesla AI, succinctly encapsulated the sentiment of the AI community: “Wow, very nice ‘full-stack’ release.”