Most GPU workloads run at 30-50% utilization. That means you're paying for twice the compute you're actually using. Profiling shows you exactly where the waste is — and in most cases, a few targeted fixes can push utilization above 80%, effectively halving your GPU bill.
Here's how to find and fix the most common utilization killers.
Step 1: Measure Baseline Utilization
Before optimizing, know where you stand:
# Quick snapshot
nvidia-smi
# Continuous monitoring (log every second for 5 minutes)
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,memory.used,power.draw --format=csv -l 1 > gpu_log.csv
Healthy numbers to target:
- GPU utilization: 80-95% during active processing
- Memory utilization: 70-90% (unused memory is wasted money)
- Power draw: near the TDP (300W for A100, 700W for H100 SXM) during compute
If GPU utilization hovers under 50%, the GPU is waiting for something — data, CPU preprocessing, network I/O, or Python itself.
Step 2: Find the Bottleneck
Use PyTorch Profiler to see exactly what's happening:
from torch.profiler import profile, ProfilerActivity, schedule
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler("./prof"),
record_shapes=True,
profile_memory=True,
with_stack=True,
) as prof:
for step, batch in enumerate(train_loader):
if step >= 10:
break
train_step(model, optimizer, batch)
prof.step()
# Print top time consumers
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
The output shows time spent per operation on both CPU and GPU. Common revelations:
- DataLoader
__next__eating 40% of total time → data pipeline is the bottleneck aten::copy_dominating → too much CPU-GPU data transfer- Gaps between CUDA kernels → CPU overhead between operations (Python/framework overhead)
nccl::allReducetaking 20%+ → communication bottleneck in distributed training
Step 3: Fix the Common Culprits
Data loading is too slow (most frequent cause):
The GPU finishes processing a batch and sits idle waiting for the next one. Fix with:
train_loader = DataLoader(
dataset,
batch_size=64,
num_workers=8, # Match CPU cores
pin_memory=True, # Faster CPU→GPU transfer
prefetch_factor=3, # Buffer 3 batches ahead
persistent_workers=True # Don't respawn workers each epoch
)
For maximum throughput, pre-process data into binary formats (WebDataset, Arrow, TFRecord) so workers spend time reading, not parsing.
Batch size is too small:
Tiny batches underutilize GPU parallelism. Each CUDA kernel launch has fixed overhead — larger batches amortize it.
Rule of thumb: increase batch size until GPU memory is 80-90% full. Use gradient accumulation if the effective batch is too large for convergence:
accumulation_steps = 4
for i, batch in enumerate(dataloader):
loss = model(batch) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
CPU-GPU synchronization stalls:
Calling .item(), .cpu(), or print(tensor) during training forces a CUDA synchronization — the CPU waits for all GPU operations to complete. This can stall the pipeline for milliseconds per call.
- Move logging to every N steps, not every step
- Use
loss.detach()instead ofloss.item()where possible - Never call
.cpu()on a tensor during the training loop
torch.compile not enabled (free 20-40% speedup):
PyTorch 2.0+ can fuse operations and reduce kernel launch overhead:
model = torch.compile(model, mode="reduce-overhead")
The first few iterations are slow (compilation), but steady-state throughput improves 20-40% for transformer models.
Quick Reference: Utilization Targets by Workload
| Workload | Expected GPU util | If below, suspect... |
|---|---|---|
| LLM training | 85-95% | Data loading, small batch, communication |
| LLM inference (batched) | 70-90% | Low request volume, poor batching |
| Image generation | 80-90% | Preprocessing, small batch |
| Fine-tuning (LoRA) | 60-80% | Forward pass is small, optimizer dominant |
| Embedding generation | 85-95% | Batch too small, CPU bottleneck |
Monitoring in Production
For ongoing optimization (not just one-time profiling), set up lightweight continuous monitoring:
# Log GPU stats to your monitoring system every 60 seconds
import subprocess, json, time
while True:
result = subprocess.run(
["nvidia-smi", "--query-gpu=utilization.gpu,memory.used,memory.total,power.draw",
"--format=csv,noheader,nounits"],
capture_output=True, text=True
)
gpu_util, mem_used, mem_total, power = result.stdout.strip().split(", ")
# Send to Prometheus, Datadog, or your logging system
log_metrics(gpu_util=float(gpu_util), mem_used=float(mem_used), power=float(power))
time.sleep(60)
Track utilization over days, not minutes. Patterns emerge — maybe your GPUs idle every night because your training pipeline has a daily preprocessing step that blocks.
Maximize GPU ROI on io.net — per-second billing means optimized utilization directly reduces your bill. Start optimizing
