← Back to Blog
Fine-Tune LLaMA 3 with QLoRA on a Single GPU
Large Language Models like LLaMA 3 have revolutionized what's possible with AI. But fine-tuning a 70B parameter model seems impossible without a cluster of A100 GPUs, right? Wrong.
With QLoRA (Quantized Low-Rank Adaptation), you can fine-tune LLaMA 3 on a single GPU with as little as 24GB VRAM. This guide walks you through the entire process.
What You'll Learn
- How QLoRA reduces memory requirements by 75%+
- Dataset preparation for instruction fine-tuning
- Training configuration and hyperparameter tuning
- Evaluation and benchmarking your fine-tuned model
- Deploying with vLLM for production inference
Prerequisites
- Python 3.10+ with PyTorch 2.x
- A GPU with 24GB+ VRAM (RTX 4090, A5000, or Colab A100)
- Hugging Face account with LLaMA 3 access
Step 1: Install Dependencies
pip install torch transformers peft bitsandbytes
pip install trl datasets accelerate
pip install flash-attn --no-build-isolation Step 2: Load Model in 4-bit Quantization
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B") This loads the 8B model using only ~6GB VRAM instead of the usual 16GB+.
Step 3: Configure LoRA
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 (0.16% of 8B) Want hands-on help with this? Book a 1-on-1 session and we'll fine-tune a model together on your dataset.
This is just the beginning — the full article covers training, evaluation, merging adapters, and deploying with vLLM.
Ready for the Full Deep Dive?
This article is part of the LLM Fine-Tuning Workshop course, which includes 18 lessons, Colab notebooks, and 14 hours of content.
Need Help Fine-Tuning Your Model?
Book a session and I'll help you fine-tune on your specific dataset.
Book a Session →