Most GPU clusters run at 30-50% utilization. That means half or more of your compute budget pays for idle silicon. For a team spending $50,000 per month on GPU cloud, poor utilization wastes $15,000-$25,000 every month.
The root causes are predictable: batch jobs that reserve GPUs but do not saturate them, inference servers sized for peak traffic that mostly run at trough, training scripts with I/O bottlenecks that starve the GPU of data, and development clusters that nobody remembers to shut down over weekends.
Fixing GPU utilization delivers immediate, measurable cost savings. On io.net, where H100 80GB GPUs cost approximately $2.49/hr, every percentage point of improved utilization translates directly into lower monthly bills. Going from 40% to 80% utilization effectively halves your cost per useful computation.
Measuring GPU Utilization Correctly
Understanding nvidia-smi Metrics
The nvidia-smi utility reports several metrics that are frequently misunderstood.
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw --format=csv -l 5
| Metric | What It Measures | What It Misses |
|---|---|---|
| GPU Utilization (%) | Time with at least one kernel executing | Kernel efficiency, SM occupancy |
| Memory Utilization (%) | Time memory controller is active | Actual allocation vs reservation |
| Memory Used | VRAM currently allocated | Whether allocated memory is actively used |
| Power Draw | Current consumption | Efficiency of the computation |
A GPU at 100% utilization could be running inefficient kernels on a fraction of its streaming multiprocessors. A GPU at 60% utilization with optimized kernels might do more useful work.
Better Metrics for AI Workloads
| Workload | Correct Metric | Good | Excellent |
|---|---|---|---|
| LLM Training | Tokens/sec/GPU | >3,000 (70B, BF16) | >4,500 |
| LLM Inference | Requests/sec or tokens/sec | >50 tok/s per user | >100 |
| Image Training | Images/sec/GPU | >500 (ResNet-50) | >1,000 |
| Scientific | Achieved TFLOPS | >50% of peak | >70% |
Setting Up Monitoring
Deploy the NVIDIA DCGM Exporter for Prometheus:
# Install DCGM exporternvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
docker run -d --gpus all --name dcgm-exporter \
-p 9400:9400 \
# Scrape from Prometheus
# Add to prometheus.yml:
# - job_name: 'dcgm'
# static_configs:
# - targets: ['localhost:9400']
Build a Grafana dashboard showing: - GPU utilization over time (per GPU) - Memory usage vs. capacity - Power draw vs. TDP - Throughput metrics (tokens/sec, images/sec) - Estimated hourly cost (active GPUs x rate)
Common Problems and Fixes
Problem 1: Data Loading Bottleneck
Symptom: GPU utilization oscillates between 0% and 100%. Cause: Data pipeline cannot feed the GPU fast enough.
# Fix: Optimize PyTorch DataLoader
from torch.utils.data import DataLoader
loader = DataLoader(
dataset,
batch_size=64,
num_workers=8, # Match CPU cores
pin_memory=True, # Faster GPU transfer
prefetch_factor=4, # Prefetch ahead
persistent_workers=True, # Avoid worker respawn
)
Additional fixes: - Use NVMe SSDs (not HDD) for training data - Pre-tokenize data before training - Use NVIDIA DALI for GPU-accelerated preprocessing - Memory-map large datasets
Problem 2: Inference Over-Provisioning
Symptom: GPU utilization consistently below 30%. Cause: Servers sized for peak, running at trough.
# Fix: Kubernetes HPA auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "70"
Problem 3: Small Batch Sizes
Symptom: GPU memory only 40-60% used during training. Cause: Batch size too small.
Fix: Increase batch size until memory is 85-95% utilized. Use gradient accumulation to maintain the same effective batch size for convergence.
Problem 4: Forgotten Development Clusters
Symptom: Clusters running 24/7 with <10% utilization. Cause: Nobody shut them down after the experiment.
Fix: Implement automatic shutdown for idle clusters:
# Auto-terminate idle io.net clusters
import time
def monitor_and_cleanup(client, idle_threshold_minutes=60):
clusters = client.list_clusters()
for cluster in clusters:
if cluster.gpu_utilization < 5:
idle_time = time.time() - cluster.last_active
if idle_time > idle_threshold_minutes * 60:
print(f"Terminating idle cluster: {cluster.name}")
client.terminate_cluster(cluster.id)
Deploy on io.net Today
Access H100 GPUs at $2.49/hr and A100s at $1.89/hr. No commitments, no minimums. Scale your AI workloads instantly.
Cost Impact of Utilization Improvements
| Current | Target | 8x H100 Monthly Savings |
|---|---|---|
| 30% -> 60% | Double effective capacity | $7,171 (halve GPU count) |
| 40% -> 70% | 1.75x effective capacity | $6,120 |
| 50% -> 80% | 1.6x effective capacity | $5,380 |
Quick Wins Ranked by Impact
| Optimization | Effort | Utilization Gain | Cost Savings |
|---|---|---|---|
| Shut down idle dev instances | 5 minutes | Eliminates waste | 10-30% |
| Enable persistent DataLoader workers | 10 minutes | +10-20% training | 5-15% |
| Switch to vLLM from HF generate | 1 hour | +30-50% inference | 15-30% |
| Auto-scale inference replicas | 1 day | Match traffic shape | 20-40% |
| Quantize inference model (INT4) | 2 hours | 2-4x throughput/GPU | 50-75% |
Advanced: Model FLOPS Utilization (MFU)
MFU measures what fraction of theoretical peak FLOPS is used for model computation. It is the gold standard for training efficiency.
MFU = (tokens/sec x FLOPs_per_token) / (num_GPUs x peak_TFLOPS)
For Llama 70B on 8x H100:
tokens/sec = 4,000
FLOPs_per_token = 6 x 70e9 = 420 GFLOP (forward + backward)
peak per GPU = 1,979 TFLOPS (BF16)
MFU = (4000 x 420e9) / (8 x 1979e12) = 0.106 = 10.6%
Wait, that seems low. In practice: - MFU of 30-45% is good for multi-node training - MFU above 50% is excellent (requires careful optimization) - MFU of 10-20% indicates significant optimization opportunity
Improving MFU
- Increase batch size (more compute per communication)
- Use Flash Attention (reduces memory overhead)
- Overlap communication with computation (gradient all-reduce during backward)
- Use BF16 or FP8 (match tensor core precision)
- Optimize activation checkpointing (trade compute for memory efficiently)
Operational Best Practices
Weekly Utilization Review Checklist
- Identify GPU instances below 30% average utilization
- Find zombie clusters that nobody is using
- Check inference auto-scaling effectiveness
- Review training efficiency (tokens/sec vs. expected)
- Calculate total cost waste from underutilization
- Set action items for next week

GPU Budget Allocation
Monthly Budget: 10,000 H100-hours ($24,900 on io.net)
- Production inference: 5,000 hrs (target 75% util)
- Training: 3,000 hrs (target 85% util)
- Dev/experimentation: 1,500 hrs (target 50% util)
- Burst buffer: 500 hrs
Frequently Asked Questions
What is a good GPU utilization target?
Training: 75-90%. Inference with auto-scaling: 60-80% average. Below 50% sustained means over-provisioned or bottlenecked.
How do I know if data loading is the problem?
Watch nvidia-smi for oscillation between 0% and 100%. If GPU drops to 0% regularly, data loading is starving it. Also check CPU utilization.
Does higher utilization damage GPUs?
No. GPUs are designed to run at 100% utilization continuously. Thermal throttling protects the hardware automatically. Higher utilization is always better.
How often should I check utilization?
Continuously via monitoring for production. Weekly aggregate review for optimization. Immediately after configuration changes.
What monitoring tools does io.net provide?
io.net's dashboard shows real-time GPU utilization, memory usage, and cluster status. For deeper monitoring, deploy Prometheus + Grafana on your cluster.
What is the fastest way to improve utilization?
Shut down idle clusters. This takes 5 minutes and can save 10-30% of your monthly GPU spend immediately.
Conclusion
GPU utilization optimization is the highest-ROI infrastructure investment most AI teams can make. The techniques are straightforward --- monitoring, right-sizing, auto-scaling, eliminating bottlenecks --- and savings are immediate.
On io.net, every utilization improvement translates directly to lower costs. No wasted reservations, no annual commitments, no sunk hardware costs. You pay for what you use, period.
Optimize your GPU spend on io.net. Sign up and start monitoring utilization today.