
Fine-tuning a large language model on your own data can transform a generic foundation model into a domain expert. This step-by-step guide walks you through the process using a cloud GPU instance.
Marcus Reid
Senior DevOps Engineer · LightYear Cloud
Foundation models like Llama 3, Mistral, and Qwen are trained on trillions of tokens of general-purpose text. They are impressively capable out of the box, but they do not know your company's products, your codebase, your legal domain, or your customer's specific terminology. Fine-tuning bridges that gap — adapting a pre-trained model to your specific task using a relatively small dataset and a fraction of the compute required for pre-training.
This guide walks through the practical steps to fine-tune an open-source LLM on a cloud GPU instance, from choosing the right hardware to evaluating your results.
There are two broad approaches to fine-tuning. Full fine-tuning updates all of the model's weights during training. It produces the best results but requires storing and updating billions of parameters, which demands substantial GPU memory — typically 2–4x the model's inference footprint.
Parameter-Efficient Fine-Tuning (PEFT) methods — most notably LoRA (Low-Rank Adaptation) and QLoRA (Quantised LoRA) — freeze the original model weights and train only a small set of adapter parameters. QLoRA in particular makes it possible to fine-tune a 7B parameter model on a single GPU with 24 GB of VRAM, and a 13B model on a 40–48 GB GPU. For most practical fine-tuning tasks, QLoRA is the recommended starting point.
GPU VRAM is the primary constraint in fine-tuning. As a rule of thumb, a model requires approximately 2 GB of VRAM per billion parameters for inference in FP16, and 3–4x that for full fine-tuning. QLoRA reduces this dramatically through 4-bit quantisation. The table below gives practical guidance:
| Model Size | Full Fine-Tune VRAM | QLoRA VRAM | Recommended GPU |
|---|---|---|---|
| 7B | ~56 GB | ~6–10 GB | A16 (16 GB) or A40 (48 GB) |
| 13B | ~104 GB | ~12–18 GB | A40 (48 GB) |
| 30B | ~240 GB | ~20–28 GB | A100 40 GB or A40 |
| 70B | ~560 GB | ~48–56 GB | A100 80 GB (×2 recommended) |
Step 1: Provision a GPU instance. Deploy a cloud GPU instance with at least 24 GB of VRAM — an A40 or A100 is ideal for a 7B–13B model. Choose Ubuntu 22.04 with CUDA pre-installed if your provider offers it, or install the CUDA toolkit manually after provisioning.
Step 2: Install dependencies. You will need PyTorch with CUDA support, the Hugging Face transformers and datasets libraries, peft for LoRA adapters, and bitsandbytes for 4-bit quantisation. A single pip install command handles all of these.
Step 3: Prepare your dataset. Fine-tuning data should be formatted as instruction-response pairs (for instruction-tuned models) or as plain text (for continued pre-training). A dataset of 1,000–10,000 high-quality examples is typically sufficient for domain adaptation. Format it as a Hugging Face Dataset object for compatibility with the training pipeline.
Step 4: Configure QLoRA. Load the base model in 4-bit precision using BitsAndBytesConfig, then wrap it with a LoRA configuration targeting the attention projection layers (q_proj, v_proj). A rank of 16–64 and an alpha of 32–128 are common starting points.
Step 5: Train. Use the Hugging Face Trainer or the trl library's SFTTrainer for supervised fine-tuning. Monitor training loss and validation loss — if validation loss starts increasing while training loss continues to fall, you are overfitting and should reduce the number of epochs or increase regularisation.
Step 6: Merge and export. After training, merge the LoRA adapter weights back into the base model and save in the Hugging Face format. The resulting model can be served with vLLM, Ollama, or any other inference framework.
A QLoRA fine-tuning run on a 7B model with 5,000 training examples typically completes in 1–3 hours on an A40 GPU. At LightYear's hourly GPU pricing, a complete fine-tuning job for a 7B model costs less than the price of a cup of coffee. Even a 70B model fine-tuning run on a pair of A100s is typically achievable within a single day's budget for most teams.
The key advantage of cloud GPU fine-tuning over local hardware is the ability to run multiple experiments in parallel — trying different hyperparameters, datasets, or base models simultaneously — and only paying for the compute you actually use.
On-demand NVIDIA GPU servers billed by the hour. No contracts, no minimum spend. Spin up in under 60 seconds.