Data parallelism splits batches across GPUs, achieving near-linear speedup (8 GPUs = 7-8x faster with 85-95% efficiency). Model parallelism splits model layers across GPUs for models too large for single GPU. Example: Training LLaMA 2 70B takes 7 days on 1x A100 → 1 day on 8x A100 (85% scaling efficiency). Communication overhead (gradient sync, activation passing) limits scaling beyond 16-32 GPUs without specialized interconnects (NVLink, InfiniBand).

Two Types of Distributed Training

1. Data Parallelism (Most Common)

Split the training batch across GPUs. Each GPU processes a subset, then gradients are synchronized.

  • How it works: 8 GPUs × 4 samples/GPU = 32 samples/batch total
  • Speedup: Near-linear (8 GPUs = 7-8x faster)
  • Best for: Models that fit on single GPU, large datasets

2. Model Parallelism

Split model layers across GPUs. GPU1 computes layers 1-10, GPU2 computes layers 11-20.

  • How it works: Activations pass GPU1 → GPU2 → GPU3 (pipeline)
  • Speedup: Sub-linear (8 GPUs = 4-6x faster, due to sequential dependencies)
  • Best for: Models >70B parameters that don't fit on single GPU

Scaling Efficiency: Real-World Benchmarks

GPU CountIdeal SpeedupActual Speedup (Data Parallel)Efficiency
2 GPUs2.0x1.85x92%
4 GPUs4.0x3.5x87%
8 GPUs8.0x6.8x85%
16 GPUs16.0x12.8x80%

Why not 100% efficiency? Communication overhead. Synchronizing gradients across 8 GPUs takes time (1-2 sec per batch with NVLink, 5-10 sec with PCIe).

Real-World Example: LLaMA 2 70B Training

  • 1x A100 80GB: 7 days to train on 100B tokens
  • 8x A100 80GB: 1 day (85% efficiency = 5.95x speedup vs ideal 7x)
  • Cost comparison: 1 GPU × 168 hrs × $2/hr = $336. 8 GPUs × 24 hrs × $2/hr = $384. You pay 14% more for 7x faster results.

Frameworks for Distributed Training

  • PyTorch DDP: Data parallelism, easy setup, 85-90% efficiency
  • DeepSpeed: Both data + model parallelism, ZeRO optimizer (reduces memory)
  • FSDP: Fully Sharded Data Parallel (PyTorch), 90%+ efficiency for 100+ GPUs
  • Ray Train: Distributed training with autoscaling, handles multi-node clusters

Key insight: Distributed training isn't about saving money—it's about saving time. 8 GPUs for 1 day costs the same as 1 GPU for 7 days, but you get results 7 days earlier.

Deploy Distributed Training Cluster

Launch 2-100 GPU clusters on io.net in 60 seconds. PyTorch DDP, DeepSpeed, and Ray pre-configured.

Launch Multi-GPU Cluster →