Most GPU clusters run at 30-50% utilization. That means half or more of your compute budget pays for idle silicon. For a team spending $50,000 per month on GPU cloud, poor utilization wastes $15,000-$25,000 every month.

The root causes are predictable: batch jobs that reserve GPUs but do not saturate them, inference servers sized for peak traffic that mostly run at trough, training scripts with I/O bottlenecks that starve the GPU of data, and development clusters that nobody remembers to shut down over weekends.

Fixing GPU utilization delivers immediate, measurable cost savings. On io.net, where H100 80GB GPUs cost approximately $2.49/hr, every percentage point of improved utilization translates directly into lower monthly bills. Going from 40% to 80% utilization effectively halves your cost per useful computation.

Measuring GPU Utilization Correctly

Understanding nvidia-smi Metrics

The nvidia-smi utility reports several metrics that are frequently misunderstood.

nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw --format=csv -l 5

MetricWhat It MeasuresWhat It Misses
GPU Utilization (%)Time with at least one kernel executingKernel efficiency, SM occupancy
Memory Utilization (%)Time memory controller is activeActual allocation vs reservation
Memory UsedVRAM currently allocatedWhether allocated memory is actively used
Power DrawCurrent consumptionEfficiency of the computation

A GPU at 100% utilization could be running inefficient kernels on a fraction of its streaming multiprocessors. A GPU at 60% utilization with optimized kernels might do more useful work.

Better Metrics for AI Workloads

WorkloadCorrect MetricGoodExcellent
LLM TrainingTokens/sec/GPU>3,000 (70B, BF16)>4,500
LLM InferenceRequests/sec or tokens/sec>50 tok/s per user>100
Image TrainingImages/sec/GPU>500 (ResNet-50)>1,000
ScientificAchieved TFLOPS>50% of peak>70%

Setting Up Monitoring

Deploy the NVIDIA DCGM Exporter for Prometheus:

# Install DCGM exporter
docker run -d --gpus all --name dcgm-exporter \
-p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04

# Scrape from Prometheus
# Add to prometheus.yml:
# - job_name: 'dcgm'
# static_configs:
# - targets: ['localhost:9400']

Build a Grafana dashboard showing: - GPU utilization over time (per GPU) - Memory usage vs. capacity - Power draw vs. TDP - Throughput metrics (tokens/sec, images/sec) - Estimated hourly cost (active GPUs x rate)

Common Problems and Fixes

Problem 1: Data Loading Bottleneck

Symptom: GPU utilization oscillates between 0% and 100%. Cause: Data pipeline cannot feed the GPU fast enough.

# Fix: Optimize PyTorch DataLoader
from torch.utils.data import DataLoader

loader = DataLoader(
dataset,
batch_size=64,
num_workers=8, # Match CPU cores
pin_memory=True, # Faster GPU transfer
prefetch_factor=4, # Prefetch ahead
persistent_workers=True, # Avoid worker respawn
)

Additional fixes: - Use NVMe SSDs (not HDD) for training data - Pre-tokenize data before training - Use NVIDIA DALI for GPU-accelerated preprocessing - Memory-map large datasets

Problem 2: Inference Over-Provisioning

Symptom: GPU utilization consistently below 30%. Cause: Servers sized for peak, running at trough.

# Fix: Kubernetes HPA auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "70"

Problem 3: Small Batch Sizes

Symptom: GPU memory only 40-60% used during training. Cause: Batch size too small.

Fix: Increase batch size until memory is 85-95% utilized. Use gradient accumulation to maintain the same effective batch size for convergence.

Problem 4: Forgotten Development Clusters

Symptom: Clusters running 24/7 with <10% utilization. Cause: Nobody shut them down after the experiment.

Fix: Implement automatic shutdown for idle clusters:

# Auto-terminate idle io.net clusters
import time

def monitor_and_cleanup(client, idle_threshold_minutes=60):
clusters = client.list_clusters()
for cluster in clusters:
if cluster.gpu_utilization < 5:
idle_time = time.time() - cluster.last_active
if idle_time > idle_threshold_minutes * 60:
print(f"Terminating idle cluster: {cluster.name}")
client.terminate_cluster(cluster.id)

Deploy on io.net Today

Access H100 GPUs at $2.49/hr and A100s at $1.89/hr. No commitments, no minimums. Scale your AI workloads instantly.

Get Started

Cost Impact of Utilization Improvements

CurrentTarget8x H100 Monthly Savings
30% -> 60%Double effective capacity$7,171 (halve GPU count)
40% -> 70%1.75x effective capacity$6,120
50% -> 80%1.6x effective capacity$5,380

Quick Wins Ranked by Impact

OptimizationEffortUtilization GainCost Savings
Shut down idle dev instances5 minutesEliminates waste10-30%
Enable persistent DataLoader workers10 minutes+10-20% training5-15%
Switch to vLLM from HF generate1 hour+30-50% inference15-30%
Auto-scale inference replicas1 dayMatch traffic shape20-40%
Quantize inference model (INT4)2 hours2-4x throughput/GPU50-75%

Advanced: Model FLOPS Utilization (MFU)

MFU measures what fraction of theoretical peak FLOPS is used for model computation. It is the gold standard for training efficiency.

MFU = (tokens/sec x FLOPs_per_token) / (num_GPUs x peak_TFLOPS)

For Llama 70B on 8x H100:
tokens/sec = 4,000
FLOPs_per_token = 6 x 70e9 = 420 GFLOP (forward + backward)
peak per GPU = 1,979 TFLOPS (BF16)

MFU = (4000 x 420e9) / (8 x 1979e12) = 0.106 = 10.6%

Wait, that seems low. In practice: - MFU of 30-45% is good for multi-node training - MFU above 50% is excellent (requires careful optimization) - MFU of 10-20% indicates significant optimization opportunity

Improving MFU

  1. Increase batch size (more compute per communication)
  2. Use Flash Attention (reduces memory overhead)
  3. Overlap communication with computation (gradient all-reduce during backward)
  4. Use BF16 or FP8 (match tensor core precision)
  5. Optimize activation checkpointing (trade compute for memory efficiently)

Operational Best Practices

Weekly Utilization Review Checklist

  1. Identify GPU instances below 30% average utilization
  2. Find zombie clusters that nobody is using
  3. Check inference auto-scaling effectiveness
  4. Review training efficiency (tokens/sec vs. expected)
  5. Calculate total cost waste from underutilization
  6. Set action items for next week

GPU Budget Allocation

Monthly Budget: 10,000 H100-hours ($24,900 on io.net)

- Production inference: 5,000 hrs (target 75% util)
- Training: 3,000 hrs (target 85% util)
- Dev/experimentation: 1,500 hrs (target 50% util)
- Burst buffer: 500 hrs

Frequently Asked Questions

What is a good GPU utilization target?

Training: 75-90%. Inference with auto-scaling: 60-80% average. Below 50% sustained means over-provisioned or bottlenecked.

How do I know if data loading is the problem?

Watch nvidia-smi for oscillation between 0% and 100%. If GPU drops to 0% regularly, data loading is starving it. Also check CPU utilization.

Does higher utilization damage GPUs?

No. GPUs are designed to run at 100% utilization continuously. Thermal throttling protects the hardware automatically. Higher utilization is always better.

How often should I check utilization?

Continuously via monitoring for production. Weekly aggregate review for optimization. Immediately after configuration changes.

What monitoring tools does io.net provide?

io.net's dashboard shows real-time GPU utilization, memory usage, and cluster status. For deeper monitoring, deploy Prometheus + Grafana on your cluster.

What is the fastest way to improve utilization?

Shut down idle clusters. This takes 5 minutes and can save 10-30% of your monthly GPU spend immediately.

Conclusion

GPU utilization optimization is the highest-ROI infrastructure investment most AI teams can make. The techniques are straightforward --- monitoring, right-sizing, auto-scaling, eliminating bottlenecks --- and savings are immediate.

On io.net, every utilization improvement translates directly to lower costs. No wasted reservations, no annual commitments, no sunk hardware costs. You pay for what you use, period.


Optimize your GPU spend on io.net. Sign up and start monitoring utilization today.