FAQ: How Do I Debug GPU Workloads in the Cloud?

Debugging on a cloud GPU is different from debugging locally. You can't just pop open a GUI profiler or restart your machine. SSH sessions time out, CUDA errors are cryptic, and OOM kills happen without warning. After years of watching teams burn GPU hours chasing ghosts, here are the debugging workflows that actually save time and money.

The Most Common GPU Problems (and Fast Fixes)

1. CUDA Out of Memory (OOM)

The single most frequent GPU error. The traceback will say something like CUDA error: out of memory. Tried to allocate 2.00 GiB.

First, figure out what's eating your memory:

import torch
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved:  {torch.cuda.memory_reserved() / 1e9:.2f} GB")
print(torch.cuda.memory_summary())

Usual culprits:
- Batch size too large → reduce it
- Accumulating tensors in a Python list during training → use .item() for scalar losses
- Not calling optimizer.zero_grad() → gradients accumulate forever
- Evaluation loop not wrapped in torch.no_grad() → stores activations for backward pass that'll never happen

2. Training loss is NaN or Inf

Almost always a numerical stability issue. Debug checklist:
- Check for zeros in your input data (division by zero in normalization)
- Reduce learning rate by 10x
- If using FP16, switch to BF16 or increase the GradScaler initial scale
- Add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
- Inspect specific layers: for name, p in model.named_parameters(): if p.grad is not None and torch.isnan(p.grad).any(): print(f"NaN gradient in {name}")

3. GPU utilization stuck at 0% or very low

Your GPU is idle because something else is the bottleneck:

# Check GPU utilization
nvidia-smi --query-gpu=utilization.gpu --format=csv -l 1

If utilization is under 20%, the bottleneck is likely:
- Data loading — increase num_workers in DataLoader, enable pin_memory=True
- CPU preprocessing — move preprocessing to GPU or pre-process offline
- Network I/O — downloading data during training instead of pre-loading
- Python GIL — use torch.multiprocessing instead of threading

4. Multi-GPU training hangs

Usually a NCCL communication deadlock. Most common causes:
- One GPU crashes silently (check dmesg for hardware errors)
- Network interface mismatch — set NCCL_SOCKET_IFNAME=eth0 (or whatever your interface is)
- Firewall blocking NCCL ports — io.net clusters have these open, but custom setups might not
- Asymmetric execution — one rank took a different code path (if/else based on input that differs across ranks)

Debug with: NCCL_DEBUG=INFO torchrun --nproc_per_node=8 train.py

Essential Monitoring Commands

Keep a second terminal open with these running:

# Real-time GPU monitoring (updates every second)
watch -n 1 nvidia-smi

# Detailed GPU process list
nvidia-smi pmon -s u -d 1

# System memory and CPU
htop

# Disk I/O (data loading bottlenecks)
iostat -x 1

# Network throughput (multi-node training)
iftop -i eth0

Remote Debugging Setup

For interactive debugging on cloud GPUs, set up VS Code Remote SSH or use debugpy:

# Add to your training script for remote attach
import debugpy
debugpy.listen(("0.0.0.0", 5678))
print("Waiting for debugger attach...")
debugpy.wait_for_client()

Then forward port 5678 from the cloud GPU to your local machine:

ssh -L 5678:localhost:5678 user@gpu-instance

Attach VS Code's debugger to localhost:5678. Now you can set breakpoints, inspect tensors, and step through code running on a cloud GPU.

Profiling Before Debugging

Sometimes the problem isn't a bug — it's a performance problem masquerading as a bug ("training seems stuck" when really it's just slow). Profile first:

# PyTorch profiler — generates a Chrome trace
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA
    ],
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
    record_shapes=True,
    with_stack=True
) as prof:
    for step, batch in enumerate(dataloader):
        if step >= 5:  # Profile 5 steps
            break
        train_step(model, batch)
        prof.step()

Open the trace in Chrome (chrome://tracing) or TensorBoard to see exactly where time is spent.

When You're Really Stuck

If none of the above helps, these steps almost always unblock you:

Reproduce on a single GPU — if the bug disappears, it's a distributed training issue
Reduce to the minimum reproducing example — strip out data augmentation, use dummy data, reduce model size
Check NVIDIA driver and CUDA version compatibility — nvidia-smi shows driver version, nvcc --version shows CUDA. Mismatches cause subtle, maddening bugs
Update PyTorch — many CUDA bugs are fixed in newer releases

Debug GPU workloads on io.net — SSH access, real-time monitoring, per-second billing. Launch instance