FAQ: How Do I Profile and Optimize GPU Utilization?

Most GPU workloads run at 30-50% utilization. That means you're paying for twice the compute you're actually using. Profiling shows you exactly where the waste is — and in most cases, a few targeted fixes can push utilization above 80%, effectively halving your GPU bill.

Here's how to find and fix the most common utilization killers.

Step 1: Measure Baseline Utilization

Before optimizing, know where you stand:

# Quick snapshot
nvidia-smi

# Continuous monitoring (log every second for 5 minutes)
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,memory.used,power.draw --format=csv -l 1 > gpu_log.csv

Healthy numbers to target:
- GPU utilization: 80-95% during active processing
- Memory utilization: 70-90% (unused memory is wasted money)
- Power draw: near the TDP (300W for A100, 700W for H100 SXM) during compute

If GPU utilization hovers under 50%, the GPU is waiting for something — data, CPU preprocessing, network I/O, or Python itself.

Step 2: Find the Bottleneck

Use PyTorch Profiler to see exactly what's happening:

from torch.profiler import profile, ProfilerActivity, schedule

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
    on_trace_ready=torch.profiler.tensorboard_trace_handler("./prof"),
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    for step, batch in enumerate(train_loader):
        if step >= 10:
            break
        train_step(model, optimizer, batch)
        prof.step()

# Print top time consumers
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))

The output shows time spent per operation on both CPU and GPU. Common revelations:

DataLoader __next__ eating 40% of total time → data pipeline is the bottleneck
aten::copy_ dominating → too much CPU-GPU data transfer
Gaps between CUDA kernels → CPU overhead between operations (Python/framework overhead)
nccl::allReduce taking 20%+ → communication bottleneck in distributed training

Step 3: Fix the Common Culprits

Data loading is too slow (most frequent cause):

The GPU finishes processing a batch and sits idle waiting for the next one. Fix with:

train_loader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=8,         # Match CPU cores
    pin_memory=True,       # Faster CPU→GPU transfer
    prefetch_factor=3,     # Buffer 3 batches ahead
    persistent_workers=True # Don't respawn workers each epoch
)

For maximum throughput, pre-process data into binary formats (WebDataset, Arrow, TFRecord) so workers spend time reading, not parsing.

Batch size is too small:

Tiny batches underutilize GPU parallelism. Each CUDA kernel launch has fixed overhead — larger batches amortize it.

Rule of thumb: increase batch size until GPU memory is 80-90% full. Use gradient accumulation if the effective batch is too large for convergence:

accumulation_steps = 4
for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

CPU-GPU synchronization stalls:

Calling .item(), .cpu(), or print(tensor) during training forces a CUDA synchronization — the CPU waits for all GPU operations to complete. This can stall the pipeline for milliseconds per call.

Move logging to every N steps, not every step
Use loss.detach() instead of loss.item() where possible
Never call .cpu() on a tensor during the training loop

torch.compile not enabled (free 20-40% speedup):

PyTorch 2.0+ can fuse operations and reduce kernel launch overhead:

model = torch.compile(model, mode="reduce-overhead")

The first few iterations are slow (compilation), but steady-state throughput improves 20-40% for transformer models.

Quick Reference: Utilization Targets by Workload

Workload	Expected GPU util	If below, suspect...
LLM training	85-95%	Data loading, small batch, communication
LLM inference (batched)	70-90%	Low request volume, poor batching
Image generation	80-90%	Preprocessing, small batch
Fine-tuning (LoRA)	60-80%	Forward pass is small, optimizer dominant
Embedding generation	85-95%	Batch too small, CPU bottleneck

Monitoring in Production

For ongoing optimization (not just one-time profiling), set up lightweight continuous monitoring:

# Log GPU stats to your monitoring system every 60 seconds
import subprocess, json, time

while True:
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=utilization.gpu,memory.used,memory.total,power.draw",
         "--format=csv,noheader,nounits"],
        capture_output=True, text=True
    )
    gpu_util, mem_used, mem_total, power = result.stdout.strip().split(", ")
    # Send to Prometheus, Datadog, or your logging system
    log_metrics(gpu_util=float(gpu_util), mem_used=float(mem_used), power=float(power))
    time.sleep(60)

Track utilization over days, not minutes. Patterns emerge — maybe your GPUs idle every night because your training pipeline has a daily preprocessing step that blocks.

Maximize GPU ROI on io.net — per-second billing means optimized utilization directly reduces your bill. Start optimizing