You set up a multi-GPU cluster, configure torchrun or torch.distributed.launch, and let PyTorch's DistributedDataParallel handle the rest. On io.net, the whole process takes about 10 minutes from zero to training across 8 GPUs.

But the details matter — wrong NCCL settings, mismatched CUDA versions, or poor data loading choices can kill your scaling efficiency. This walkthrough covers what actually works in production, not just the textbook version.

Step-by-Step: PyTorch DDP on io.net

1. Provision your cluster

On io.net, spin up a multi-GPU instance. For most training jobs:

  • Fine-tuning 7-13B models: 2-4x RTX 4090 ($0.36-$0.72/hr total)
  • Training 13-70B models: 4-8x A100 80GB ($5.96-$11.92/hr total)
  • Pre-training or large-scale experiments: 8x H100 SXM ($17.60/hr total)

Select SXM variants when available — NVLink interconnects give you 600-900 GB/s between GPUs versus 64 GB/s on PCIe. That gap matters enormously once you're synchronizing gradients across 4+ cards.

2. Set up your training script

The minimum changes to convert a single-GPU script to distributed:

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler

def setup():
    dist.init_process_group(backend="nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    return local_rank

def train():
    local_rank = setup()

    # Your model — wrap it in DDP
    model = YourModel().to(local_rank)
    model = DDP(model, device_ids=[local_rank])

    # Distributed sampler ensures each GPU sees different data
    dataset = YourDataset()
    sampler = DistributedSampler(dataset)
    dataloader = DataLoader(dataset, batch_size=16, sampler=sampler)

    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Shuffle differently each epoch
        for batch in dataloader:
            batch = {k: v.to(local_rank) for k, v in batch.items()}
            loss = model(**batch).loss
            loss.backward()      # Gradients auto-synced by DDP
            optimizer.step()
            optimizer.zero_grad()

    dist.destroy_process_group()

if __name__ == "__main__":
    train()

3. Launch it

# Single node, 8 GPUs
torchrun --nproc_per_node=8 train.py

# Multi-node (2 nodes, 8 GPUs each = 16 total)
# On node 0:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
    --master_addr=10.0.0.1 --master_port=29500 train.py

# On node 1:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
    --master_addr=10.0.0.1 --master_port=29500 train.py

Common Pitfalls (and How to Avoid Them)

These are the issues that burn hours of debugging time:

NCCL timeout errors. If you see NCCL watchdog timeout, it's almost always a networking issue. Set NCCL_SOCKET_IFNAME to the correct network interface and increase the timeout: NCCL_TIMEOUT=1800. On io.net clusters, the internal network interface is pre-configured, but verify with ifconfig.

OOM on rank 0 only. DDP stores the gradient reduction buffer on rank 0 by default. If only your first GPU runs out of memory, enable gradient_as_bucket_view=True in the DDP constructor and reduce your per-GPU batch size by 10%.

Slow data loading bottleneck. When 8 GPUs are waiting on data, your IO pipeline becomes the chokepoint. Use num_workers=4 per GPU (so 32 total across 8 GPUs), pin_memory=True, and pre-process your data into memory-mapped formats like WebDataset or HuggingFace Arrow.

Learning rate scaling. Linear scaling rule: if you 8x your effective batch size (across 8 GPUs), multiply your learning rate by 8. But use warmup — the first 500-1000 steps should linearly ramp up from a small LR to avoid divergence at large batch sizes.

Checkpoint saving on all ranks. Only save checkpoints from rank 0. Every other rank should skip the save call, or you'll write 8 copies of the same checkpoint and waste storage:

if dist.get_rank() == 0:
    torch.save(model.module.state_dict(), "checkpoint.pt")

Performance Tuning That Actually Matters

Not all optimizations are worth your time. These three have the biggest impact:

Enable mixed precision (BF16/FP16): Nearly doubles throughput with negligible quality loss. One line with torch.cuda.amp:

scaler = torch.GradScaler()
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
    loss = model(**batch).loss
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Gradient accumulation for effective large batches: If 8 GPUs with batch 16 each (effective 128) still isn't enough for stable training, accumulate gradients over 4 steps to simulate batch 512 without needing more memory:

accumulation_steps = 4
for i, batch in enumerate(dataloader):
    loss = model(**batch).loss / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Compile the model with torch.compile: On PyTorch 2.0+, this can give 20-40% speedup for free:

model = torch.compile(model)  # Before wrapping with DDP
model = DDP(model, device_ids=[local_rank])

Scaling Efficiency on io.net

We measured real scaling numbers on io.net clusters training Llama 3 8B:

GPUsConfigTime per epochScaling efficiencyTotal cost (10 epochs)
11x A1008.2 hrs100% (baseline)$100
22x A1004.4 hrs93%$107
44x A1002.3 hrs89%$112
88x A100 (NVLink)1.2 hrs85%$117
88x A100 (PCIe)1.7 hrs60%$165

That last row shows why NVLink matters — PCIe-only clusters lose 25% of their scaling to communication overhead, costing 41% more for the same job


Deploy a PyTorch cluster on io.net — 8x A100 with NVLink from $11.92/hr. Launch cluster