LightYear
/Docs
DocsComputeOptimise GPU Memory Usage for Deep Learning

Optimise GPU Memory Usage for Deep Learning

Techniques to reduce GPU memory consumption and fit larger models or batch sizes on your GPU.

advanced
10 min read
LightYear Team
Updated April 24, 2026
gpumemorypytorchoptimisation
Ready to get started?

Why GPU Memory Matters

GPU memory (VRAM) is the primary constraint for deep learning. Running out of VRAM causes CUDA out of memory errors and crashes your training run.

Profiling Memory Usage

PYTHON
import torch

# Check current allocation
print(torch.cuda.memory_allocated() / 1e9, "GB allocated")
print(torch.cuda.memory_reserved() / 1e9, "GB reserved")

# Detailed snapshot
torch.cuda.memory_summary(device=0, abbreviated=False)

Technique 1 — Mixed Precision Training (AMP)

PYTHON
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast():
        output = model(batch)
        loss = criterion(output, targets)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Reduces memory by ~50% with minimal accuracy loss.

Technique 2 — Gradient Checkpointing

PYTHON
from torch.utils.checkpoint import checkpoint_sequential

# Recompute activations during backward pass instead of storing them
output = checkpoint_sequential(model.layers, segments=4, input=x)

Reduces activation memory at the cost of ~30% slower training.

Technique 3 — Gradient Accumulation

PYTHON
accumulation_steps = 4
optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    output = model(batch)
    loss = criterion(output, targets) / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Simulates a larger batch size without increasing memory.

Technique 4 — 4-bit Quantisation (QLoRA)

PYTHON
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
)

Reduces a 70B model from ~140 GB to ~35 GB.

Technique 5 — Offload to CPU/NVMe

PYTHON
# DeepSpeed ZeRO-Infinity
ds_config = {
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "nvme", "nvme_path": "/mnt/nvme"},
    }
}

Summary

TechniqueMemory SavingSpeed Impact
AMP (FP16)~50%+20% faster
Gradient Checkpointing~40%-30% slower
Gradient AccumulationConstantNeutral
4-bit QLoRA~75%-15% slower
CPU Offload~80%-50% slower

Was this article helpful?

Your cookie choices for this website

This site uses cookies and related technologies, as described in our privacy policy, for purposes that may include site operation, analytics, and enhanced user experience. You may choose to consent to our use of these technologies, or manage your own preferences. Cookie policy