Data parallelism splits batches across GPUs, achieving near-linear speedup (8 GPUs = 7-8x faster with 85-95% efficiency). Model parallelism splits model layers across GPUs for models too large for single GPU. Example: Training LLaMA 2 70B takes 7 days on 1x A100 → 1 day on 8x A100 (85% scaling efficiency). Communication overhead (gradient sync, activation passing) limits scaling beyond 16-32 GPUs without specialized interconnects (NVLink, InfiniBand).
Two Types of Distributed Training
1. Data Parallelism (Most Common)
Split the training batch across GPUs. Each GPU processes a subset, then gradients are synchronized.
- How it works: 8 GPUs × 4 samples/GPU = 32 samples/batch total
- Speedup: Near-linear (8 GPUs = 7-8x faster)
- Best for: Models that fit on single GPU, large datasets
2. Model Parallelism
Split model layers across GPUs. GPU1 computes layers 1-10, GPU2 computes layers 11-20.
- How it works: Activations pass GPU1 → GPU2 → GPU3 (pipeline)
- Speedup: Sub-linear (8 GPUs = 4-6x faster, due to sequential dependencies)
- Best for: Models >70B parameters that don't fit on single GPU
Scaling Efficiency: Real-World Benchmarks
| GPU Count | Ideal Speedup | Actual Speedup (Data Parallel) | Efficiency |
|---|---|---|---|
| 2 GPUs | 2.0x | 1.85x | 92% |
| 4 GPUs | 4.0x | 3.5x | 87% |
| 8 GPUs | 8.0x | 6.8x | 85% |
| 16 GPUs | 16.0x | 12.8x | 80% |
Why not 100% efficiency? Communication overhead. Synchronizing gradients across 8 GPUs takes time (1-2 sec per batch with NVLink, 5-10 sec with PCIe).
Real-World Example: LLaMA 2 70B Training
- 1x A100 80GB: 7 days to train on 100B tokens
- 8x A100 80GB: 1 day (85% efficiency = 5.95x speedup vs ideal 7x)
- Cost comparison: 1 GPU × 168 hrs × $2/hr = $336. 8 GPUs × 24 hrs × $2/hr = $384. You pay 14% more for 7x faster results.
Frameworks for Distributed Training
- PyTorch DDP: Data parallelism, easy setup, 85-90% efficiency
- DeepSpeed: Both data + model parallelism, ZeRO optimizer (reduces memory)
- FSDP: Fully Sharded Data Parallel (PyTorch), 90%+ efficiency for 100+ GPUs
- Ray Train: Distributed training with autoscaling, handles multi-node clusters
Key insight: Distributed training isn't about saving money—it's about saving time. 8 GPUs for 1 day costs the same as 1 GPU for 7 days, but you get results 7 days earlier.
Deploy Distributed Training Cluster
Launch 2-100 GPU clusters on io.net in 60 seconds. PyTorch DDP, DeepSpeed, and Ray pre-configured.
