LLM fine-tuning diagram showing base model adaptation with domain-specific data
Back to Blog
GPU Cloud8 min readMarch 15, 2026

How to Fine-Tune an LLM on a Cloud GPU: A Practical Guide

Fine-tuning a large language model on your own data can transform a generic foundation model into a domain expert. This step-by-step guide walks you through the process using a cloud GPU instance.

MR

Marcus Reid

Senior DevOps Engineer · LightYear Cloud

Foundation models like Llama 3, Mistral, and Qwen are trained on trillions of tokens of general-purpose text. They are impressively capable out of the box, but they do not know your company's products, your codebase, your legal domain, or your customer's specific terminology. Fine-tuning bridges that gap — adapting a pre-trained model to your specific task using a relatively small dataset and a fraction of the compute required for pre-training.

This guide walks through the practical steps to fine-tune an open-source LLM on a cloud GPU instance, from choosing the right hardware to evaluating your results.

Understanding Fine-Tuning: Full vs. Parameter-Efficient

There are two broad approaches to fine-tuning. Full fine-tuning updates all of the model's weights during training. It produces the best results but requires storing and updating billions of parameters, which demands substantial GPU memory — typically 2–4x the model's inference footprint.

Diagram comparing LLM pre-training on large corpus versus fine-tuning on small domain dataset

Parameter-Efficient Fine-Tuning (PEFT) methods — most notably LoRA (Low-Rank Adaptation) and QLoRA (Quantised LoRA) — freeze the original model weights and train only a small set of adapter parameters. QLoRA in particular makes it possible to fine-tune a 7B parameter model on a single GPU with 24 GB of VRAM, and a 13B model on a 40–48 GB GPU. For most practical fine-tuning tasks, QLoRA is the recommended starting point.

Choosing the Right GPU for Your Model Size

GPU VRAM is the primary constraint in fine-tuning. As a rule of thumb, a model requires approximately 2 GB of VRAM per billion parameters for inference in FP16, and 3–4x that for full fine-tuning. QLoRA reduces this dramatically through 4-bit quantisation. The table below gives practical guidance:

Model Size Full Fine-Tune VRAM QLoRA VRAM Recommended GPU
7B~56 GB~6–10 GBA16 (16 GB) or A40 (48 GB)
13B~104 GB~12–18 GBA40 (48 GB)
30B~240 GB~20–28 GBA100 40 GB or A40
70B~560 GB~48–56 GBA100 80 GB (×2 recommended)

Step-by-Step: Fine-Tuning Llama 3 with QLoRA

Step 1: Provision a GPU instance. Deploy a cloud GPU instance with at least 24 GB of VRAM — an A40 or A100 is ideal for a 7B–13B model. Choose Ubuntu 22.04 with CUDA pre-installed if your provider offers it, or install the CUDA toolkit manually after provisioning.

Step 2: Install dependencies. You will need PyTorch with CUDA support, the Hugging Face transformers and datasets libraries, peft for LoRA adapters, and bitsandbytes for 4-bit quantisation. A single pip install command handles all of these.

Step 3: Prepare your dataset. Fine-tuning data should be formatted as instruction-response pairs (for instruction-tuned models) or as plain text (for continued pre-training). A dataset of 1,000–10,000 high-quality examples is typically sufficient for domain adaptation. Format it as a Hugging Face Dataset object for compatibility with the training pipeline.

Step 4: Configure QLoRA. Load the base model in 4-bit precision using BitsAndBytesConfig, then wrap it with a LoRA configuration targeting the attention projection layers (q_proj, v_proj). A rank of 16–64 and an alpha of 32–128 are common starting points.

GPU machine learning use cases including training, inference, and image generation

Step 5: Train. Use the Hugging Face Trainer or the trl library's SFTTrainer for supervised fine-tuning. Monitor training loss and validation loss — if validation loss starts increasing while training loss continues to fall, you are overfitting and should reduce the number of epochs or increase regularisation.

Step 6: Merge and export. After training, merge the LoRA adapter weights back into the base model and save in the Hugging Face format. The resulting model can be served with vLLM, Ollama, or any other inference framework.

Cost Estimate for a Typical Fine-Tuning Run

A QLoRA fine-tuning run on a 7B model with 5,000 training examples typically completes in 1–3 hours on an A40 GPU. At LightYear's hourly GPU pricing, a complete fine-tuning job for a 7B model costs less than the price of a cup of coffee. Even a 70B model fine-tuning run on a pair of A100s is typically achievable within a single day's budget for most teams.

The key advantage of cloud GPU fine-tuning over local hardware is the ability to run multiple experiments in parallel — trying different hyperparameters, datasets, or base models simultaneously — and only paying for the compute you actually use.

GPU Cloud — NVIDIA A16, A40, A100, L40S

Deploy a GPU instance on LightYear

On-demand NVIDIA GPU servers billed by the hour. No contracts, no minimum spend. Spin up in under 60 seconds.

Your cookie choices for this website

This site uses cookies and related technologies, as described in our privacy policy, for purposes that may include site operation, analytics, and enhanced user experience. You may choose to consent to our use of these technologies, or manage your own preferences. Cookie policy