
Not all GPUs are equal — and choosing the wrong one can cost you thousands in wasted compute. This guide breaks down the H100, A100, and L40S by workload type so you can match the right GPU to your training, fine-tuning, or inference job.
Marcus Reid
Senior DevOps Engineer · LightYear Cloud
The cloud GPU market has expanded dramatically over the past two years. Where developers once had a handful of options, they now face a matrix of GPU models, memory configurations, and pricing tiers that can be genuinely difficult to navigate. The three GPUs that dominate serious AI workloads in 2026 are NVIDIA's H100, A100, and L40S — each with distinct strengths, and each suited to different types of work. Choosing the wrong one does not just affect performance; it directly affects your compute bill.
This guide breaks down each GPU by architecture, memory, and real-world performance characteristics, then maps them to the four most common AI workload types: large model training, fine-tuning, batch inference, and real-time inference. By the end, you will have a clear framework for making the right choice for your specific use case.
NVIDIA H100 (Hopper architecture) is the current flagship for AI compute. Available in SXM5 (80 GB HBM3, 3.35 TB/s memory bandwidth) and PCIe (80 GB HBM2e, 2 TB/s) variants, the H100 introduced the Transformer Engine — hardware specifically designed to accelerate the attention mechanisms that underpin large language models. FP8 training support allows the H100 to deliver up to 4x the throughput of the A100 on transformer workloads, while NVLink 4.0 enables tight multi-GPU coupling for models that span multiple cards. The H100 is the most expensive option, but for workloads that can fully utilise its capabilities, it is also the most cost-efficient per unit of useful compute.
NVIDIA A100 (Ampere architecture) remains a workhorse for AI in 2026. Available in 40 GB and 80 GB HBM2e configurations, the A100 offers 2 TB/s memory bandwidth and strong FP16/BF16 performance. It lacks the H100's Transformer Engine and FP8 support, which means it is slower on pure transformer training, but it is a highly capable and well-understood GPU with mature software support across every major framework. The A100 80 GB is particularly well-suited to workloads that require large memory capacity but do not need the absolute peak throughput of the H100.
NVIDIA L40S (Ada Lovelace architecture) occupies a different position in the market. With 48 GB of GDDR6 memory (rather than HBM), the L40S has lower memory bandwidth than either the H100 or A100, but it offers strong FP32 and FP16 performance at a significantly lower price point. The L40S also includes dedicated hardware for ray tracing and video encoding, making it versatile beyond pure AI workloads. For inference and fine-tuning tasks that fit within 48 GB of VRAM, the L40S often delivers the best cost-per-token or cost-per-inference in the market.
Training a large language model from scratch — whether a 7B, 13B, 70B, or larger parameter model — is the most demanding AI workload in terms of both compute and memory. These jobs run for days or weeks, require multiple GPUs working in tight coordination, and are extremely sensitive to memory bandwidth and inter-GPU communication speed.
Recommendation: H100 SXM5. The Transformer Engine's FP8 training support delivers a step-change in throughput for transformer architectures. NVLink 4.0 provides 900 GB/s bidirectional bandwidth between GPUs in a multi-GPU node, which is critical for the all-reduce operations that dominate distributed training. For large model training, the H100's higher cost per hour is offset by significantly shorter training runs — a job that takes 7 days on A100s may complete in 3–4 days on H100s.
The A100 80 GB is a viable alternative if H100 availability is limited or budget is constrained. The L40S is not recommended for large model training due to its lower memory bandwidth and the absence of NVLink in most cloud configurations.
Fine-tuning — adapting a pre-trained model to a specific domain or task using a smaller dataset — is far more common than training from scratch. Techniques like LoRA (Low-Rank Adaptation) and QLoRA have made it possible to fine-tune 7B and 13B parameter models on a single GPU with 24–48 GB of VRAM, and 70B models on two to four GPUs.
Recommendation: A100 40 GB or L40S 48 GB. For most fine-tuning jobs, the A100 40 GB or L40S 48 GB offers the best balance of memory capacity and cost. A QLoRA fine-tune of a 13B model fits comfortably in 24 GB; a 70B model with 4-bit quantisation fits in 48 GB. The L40S is particularly attractive here — its 48 GB of VRAM is slightly more than the A100 40 GB, and its lower hourly cost means fine-tuning jobs are cheaper to run. The H100 is overkill for most fine-tuning workloads unless you are running very large batches or need the fastest possible iteration cycle.
Batch inference — processing large volumes of requests offline, such as generating embeddings for a document corpus, running classification over a dataset, or producing outputs for a queue of jobs — is throughput-bound. The goal is to maximise the number of tokens or samples processed per hour at the lowest cost.
Recommendation: H100 PCIe or A100 80 GB. For batch inference on large models (30B+ parameters), the H100's FP8 inference support and high memory bandwidth deliver the best throughput. For models in the 7B–13B range, the A100 80 GB or L40S 48 GB are more cost-effective — the model fits in a single GPU and the lower hourly rate more than compensates for the throughput difference. Frameworks like vLLM and TensorRT-LLM have excellent support for all three GPUs and will automatically use the most efficient precision available on the hardware.
Real-time inference — serving a model to end users with sub-second response times — has different requirements from batch processing. Latency is the primary constraint, not throughput. The GPU must respond quickly to individual requests, which means memory bandwidth and the ability to handle small batch sizes efficiently are more important than raw throughput at large batch sizes.
Recommendation: L40S 48 GB for cost-sensitive deployments; H100 for premium latency requirements. The L40S delivers excellent latency for 7B–13B models at a price point that makes it practical to run multiple replicas for load balancing. For applications where latency is critical and budget is less constrained — such as real-time coding assistants or customer-facing AI products — the H100's higher memory bandwidth translates directly into lower time-to-first-token. The A100 sits between the two in both performance and cost.
To summarise the recommendations above: use the H100 SXM5 for large model training from scratch, where its Transformer Engine and NVLink bandwidth deliver the fastest training runs and lowest total cost despite the higher hourly rate. Use the A100 80 GB for large batch inference and memory-intensive workloads that require more than 48 GB of VRAM. Use the L40S 48 GB for fine-tuning, real-time inference on mid-size models, and any workload where cost efficiency is the primary concern — it consistently delivers the best cost-per-inference for 7B–13B models.
One final consideration: availability. H100 instances remain in high demand and are not always immediately available in every region. If your timeline is tight, the A100 and L40S are more reliably available and can be provisioned on demand. LightYear's Cloud GPU platform provides real-time availability across all three GPU types, with hourly billing so you only pay for the compute you use.
Deploy your first server in under 60 seconds. No contracts, pay only for what you use.
Get Started Free