Quick Answer
Yes, io.net fully supports multi-GPU training with configurations from 2 to 100+ GPUs across single-node and multi-node setups. You can provision GPU clusters with NVLink (600 GB/s), NVSwitch (900 GB/s), or InfiniBand (200-400 Gbps) interconnects for distributed training frameworks including PyTorch FSDP, DeepSpeed, Ray, and Horovod. An 8x H100 cluster costs $17.60/hr on io.net versus $55.84/hr on AWS (68% savings), with pre-configured environments that support data parallel, model parallel, and pipeline parallel training patterns for LLMs up to 100B+ parameters.
Multi-GPU Training Configurations
io.net offers flexible cluster configurations for every training scale:
| Configuration | GPUs | Interconnect | Bandwidth | Use Case | Cost/hr (H100) | Cost/hr (A100) |
|---|---|---|---|---|---|---|
| Single-node 2-GPU | 2 | NVLink | 600 GB/s | Small model parallel | $4.40 | $2.98 |
| Single-node 4-GPU | 4 | NVLink | 600 GB/s | Medium model training | $8.80 | $5.96 |
| Single-node 8-GPU | 8 | NVSwitch | 900 GB/s | Large model training (70B) | $17.60 | $11.92 |
| Multi-node 16-GPU | 16 | InfiniBand | 200 Gbps | Very large models (100B+) | $35.20 | $23.84 |
| Multi-node 32-GPU | 32 | InfiniBand | 400 Gbps | Pre-training from scratch | $70.40 | $47.68 |
| Multi-node 64-GPU | 64 | InfiniBand | 400 Gbps | Foundation model training | $140.80 | $95.36 |
Compare to AWS (8x H100 cluster):
- io.net: $17.60/hr
- AWS p5.48xlarge: $55.84/hr
- Savings: $38.24/hr (68% cheaper) = $27,533/month for continuous training
Supported Distributed Training Frameworks
io.net provides pre-configured support for all major distributed training frameworks:
PyTorch Distributed:
# Launch 8-GPU cluster
io launch --gpu H100 --count 8 --network nvswitch
# Fully Sharded Data Parallel (FSDP) for 70B models
torchrun --nproc_per_node=8 train.py \
--model meta-llama/Llama-3-70B \
--strategy fsdp \
--precision bf16-mixed
# Output: 8x H100 training at 3.2 samples/sec (batch size 32)
# Cost: $17.60/hr
# 100K samples in ~8.7 hours = $153 total
DeepSpeed:
# ZeRO-3 for memory-efficient training of 100B+ models
deepspeed --num_gpus=8 train.py \
--deepspeed ds_config_zero3.json \
--model meta-llama/Llama-3-70B \
--per_device_train_batch_size 2 \
--gradient_checkpointing \
--zero_stage 3
# Enables training 70B model with <40GB memory per GPU
# Can fit 175B models on 8x A100 80GB
Ray Train:
# Distributed training with fault tolerance and autoscaling
from ray import train
from ray.train.torch import TorchTrainer
trainer = TorchTrainer(
train_loop,
scaling_config=train.ScalingConfig(
num_workers=8,
use_gpu=True,
resources_per_worker={"GPU": 1}
)
)
trainer.fit()
# Ray automatically handles GPU failures and checkpointing
Horovod:
# MPI-based distributed training
horovodrun -np 8 -H localhost:8 python train.py \
--batch-size 256 \
--learning-rate 0.001
# Efficient gradient all-reduce for image models
Distributed Training Patterns Explained
Choose the right parallelism strategy based on your model size:
1. Data Parallel (DP)
- How it works: Replicate entire model on each GPU, split data batches
- Best for: Models that fit on single GPU (<7B parameters)
- Scaling efficiency: ~90% linear scaling up to 8 GPUs
- Example: Llama 3 8B training on 8x RTX 4090
# PyTorch DistributedDataParallel
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
# Each GPU trains on different data batch
# Gradients averaged across GPUs after backward pass
2. Fully Sharded Data Parallel (FSDP)
- How it works: Shard model parameters, gradients, and optimizer states across GPUs
- Best for: Models too large for single GPU (13B-70B parameters)
- Memory savings: 8x reduction vs. standard DP
- Example: Llama 3 70B training on 8x H100
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy
auto_wrap_policy = size_based_auto_wrap_policy
model = FSDP(model, auto_wrap_policy=auto_wrap_policy)
# Each GPU holds 1/8 of model parameters
# Dynamically gather needed parameters during forward/backward
3. Pipeline Parallel
- How it works: Split model layers across GPUs, process micro-batches in pipeline
- Best for: Very deep models (100+ layers), maximize GPU utilization
- Efficiency: 70-85% GPU utilization (vs. 95%+ for FSDP)
- Example: GPT-3 175B training
from torch.distributed.pipeline.sync import Pipe
# Split model: layers 0-50 on GPU 0-3, layers 51-100 on GPU 4-7
model = Pipe(model, chunks=16)
# Micro-batches overlap forward/backward across GPUs
4. 3D Parallelism (Data + Model + Pipeline)
- How it works: Combine all three parallelism types
- Best for: Extreme-scale training (1T+ parameters)
- Example: Training models larger than GPT-4
# DeepSpeed configuration for 3D parallelism
{
"train_batch_size": 2048,
"data_parallel_size": 8,
"model_parallel_size": 4,
"pipeline_parallel_size": 2
}
# 64 total GPUs = 8 (DP) × 4 (MP) × 2 (PP)
Real-World Multi-GPU Training Benchmarks
Performance and cost for common training scenarios:
Llama 3 8B LoRA Fine-Tuning (10K samples):
| Setup | Time | Cost | Samples/sec | Scaling Efficiency |
|-------|------|------|-------------|-------------------|
| 1x A100 | 6h | $7.20 | 0.46 | 100% (baseline) |
| 2x A100 (DP) | 3.2h | $7.66 | 0.87 | 94% |
| 4x A100 (DP) | 1.7h | $8.14 | 1.63 | 89% |
| 8x A100 (DP) | 1.0h | $9.54 | 2.78 | 75% |
Takeaway: Diminishing returns beyond 4 GPUs for small models. Use single GPU with larger batch size.
Llama 3 70B Full Fine-Tuning (50K samples):
| Setup | Time | Cost | Samples/sec | Memory/GPU |
|-------|------|------|-------------|------------|
| 1x H100 | Cannot fit | N/A | N/A | >80GB |
| 4x H100 (FSDP) | 36h | $317 | 0.39 | 52GB |
| 8x H100 (FSDP) | 18h | $317 | 0.77 | 38GB |
| 8x H100 (DeepSpeed ZeRO-3) | 20h | $352 | 0.69 | 28GB |
Takeaway: 8 GPUs is sweet spot for 70B models - faster training and lower memory per GPU.
Stable Diffusion XL Training (100K images, 50K steps):
| Setup | Time | Cost | Images/sec | Scaling Efficiency |
|-------|------|------|------------|-------------------|
| 1x RTX 4090 | 96h | $17.28 | 1.04 | 100% |
| 2x RTX 4090 (DP) | 50h | $18.00 | 2.00 | 96% |
| 4x RTX 4090 (DP) | 26h | $18.72 | 3.85 | 93% |
| 8x RTX 4090 (DP) | 14h | $20.16 | 7.14 | 86% |
Takeaway: Excellent scaling for image models. 4-8 GPUs significantly reduce training time.
Network Interconnect Performance
The interconnect between GPUs determines distributed training efficiency:
NVLink (2-way GPU connection):
- Bandwidth: 600 GB/s bidirectional
- Latency: <1 microsecond
- Topology: Point-to-point or ring
- Best for: 2-4 GPU training (small model parallel)
- Scaling: 85-95% efficiency for data parallel
- Cost premium: +$0.10/hr per GPU on io.net
NVSwitch (8-GPU fully connected):
- Bandwidth: 900 GB/s all-to-all
- Latency: <1 microsecond
- Topology: Full mesh (every GPU connected to every other GPU)
- Best for: 8-GPU single-node training (70B models)
- Scaling: 90-95% efficiency for FSDP
- Cost premium: +$0.25/hr per GPU on io.net
InfiniBand (multi-node clusters):
- Bandwidth: 200-400 Gbps per link
- Latency: ~2 microseconds
- Topology: Fat-tree or dragonfly
- Best for: 16-128 GPU training (100B+ models)
- Scaling: 75-90% efficiency for 3D parallelism
- Cost premium: +$0.50/hr per GPU on io.net
Comparison Table:
| Workload | 8x GPU Bandwidth Needed | NVLink | NVSwitch | InfiniBand |
|---|---|---|---|---|
| Data Parallel (small model) | ~50 GB/s | ✅ Good | ✅ Excellent | ✅ Excellent |
| FSDP (70B model) | ~200 GB/s | ⚠️ Bottleneck | ✅ Excellent | ✅ Good |
| Pipeline Parallel | ~100 GB/s | ✅ Good | ✅ Excellent | ✅ Excellent |
| 3D Parallel (multi-node) | ~300 GB/s | ❌ Not suitable | ❌ Single-node only | ✅ Required |
How to Launch Multi-GPU Clusters on io.net
Step 1: Provision GPU Cluster
# 8-GPU NVSwitch cluster for Llama 3 70B
io launch --gpu H100 --count 8 --network nvswitch --disk 1TB --name llama-training
# 16-GPU multi-node InfiniBand cluster
io launch --gpu A100 --count 16 --network infiniband --disk 2TB --multinode
# Check cluster status
io list
# Output shows GPU IDs, network topology, health status
Step 2: Configure Distributed Environment
# SSH into head node
io ssh llama-training-head
# Verify all GPUs detected
nvidia-smi
# Should show 8 GPUs
# Check inter-GPU bandwidth
nvidia-smi topo -m
# Verifies NVSwitch connectivity
Step 3: Launch Training Job
# PyTorch distributed training (FSDP)
torchrun \
--nnodes=1 \
--nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint=localhost:29500 \
train_fsdp.py \
--model meta-llama/Llama-3-70B \
--dataset custom_dataset \
--batch_size 4 \
--epochs 3
# DeepSpeed training (ZeRO-3)
deepspeed --num_gpus=8 train.py --deepspeed ds_config.json
# Ray distributed training (with autoscaling)
python train_ray.py --num-workers 8
Step 4: Monitor Training
# Built-in monitoring dashboard
io dashboard llama-training
# Shows:
# - GPU utilization per GPU (target: >90%)
# - Inter-GPU communication bandwidth
# - Training throughput (samples/sec)
# - Cost tracker (real-time spend)
# - ETA to completion
Multi-Node Training for 100B+ Models
For extreme-scale training, io.net supports multi-node clusters:
Example: 64-GPU Cluster for 175B Model
# Provision 8 nodes × 8 GPUs each
io launch --gpu H100 --count 64 --nodes 8 --network infiniband
# Configure distributed training (Megatron-LM)
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=8 \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=6000 \
pretrain_gpt.py \
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 2 \
--data-parallel-size 8 \
--num-layers 96 \
--hidden-size 12288 \
--num-attention-heads 96 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--train-iters 500000 \
--distributed-backend nccl
64x H100 Cluster Economics:
- Cost: $140.80/hr (vs. $446/hr on AWS = 68% savings)
- Pre-training 175B model: ~2 weeks = $47,309 on io.net vs. $150,528 on AWS
- Savings: $103,219 per training run
Autoscaling Multi-GPU Clusters
io.net supports dynamic GPU scaling:
# Ray autoscaling configuration
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig
scaling_config = ScalingConfig(
num_workers=4, # Start with 4 GPUs
use_gpu=True,
scaling_policy={
"min_workers": 2,
"max_workers": 16,
"scale_up_threshold": 0.9, # Add GPUs if utilization >90%
"scale_down_threshold": 0.5 # Remove GPUs if utilization <50%
}
)
trainer = TorchTrainer(train_loop, scaling_config=scaling_config)
trainer.fit()
Autoscaling Benefits:
- Start small (2-4 GPUs) for experimentation
- Scale up (8-16 GPUs) when training production model
- Scale down during checkpointing/evaluation phases
- Cost savings: 30-40% vs. running 16 GPUs continuously
Common Multi-GPU Training Issues and Solutions
Issue 1: Low GPU Utilization (<70%)
- Cause: Data loading bottleneck
- Solution: Increase DataLoader workers (num_workers=8), use faster storage (NVMe SSD), pre-cache dataset in RAM
Issue 2: Out of Memory (OOM) Errors
- Cause: Model + activations + optimizer states exceed GPU memory
- Solution: Enable gradient checkpointing, reduce batch size, use DeepSpeed ZeRO-3, or add more GPUs with FSDP
Issue 3: Poor Scaling Efficiency (<70%)
- Cause: Gradient synchronization overhead
- Solution: Increase local batch size, use gradient accumulation, enable gradient compression, upgrade to NVSwitch interconnect
Issue 4: Communication Timeouts
- Cause: Slow network between GPUs
- Solution: Verify NVLink/InfiniBand connectivity, increase NCCL timeout, check for network congestion
Related Questions
How many GPUs do I need to train Llama 3 70B?
Minimum 4x H100 80GB or 8x A100 80GB using FSDP. You can technically fit 70B on 2x H100 with DeepSpeed ZeRO-3 + CPU offloading, but training is 3-4x slower. For optimal performance and reasonable training time (3-5 days for full fine-tune), use 8x H100 ($17.60/hr). For LoRA fine-tuning, 1-2x H100 is sufficient ($2.20-4.40/hr).
Can I mix different GPU types in one cluster?
Not recommended. Mixed GPU clusters (e.g., 4x H100 + 4x A100) create imbalanced workloads where faster GPUs wait for slower ones, reducing overall efficiency to the slowest GPU's speed. io.net requires homogeneous clusters (all same GPU type) for distributed training. You can run separate jobs on different GPU types in parallel.
What's the maximum number of GPUs I can use?
io.net supports up to 128 GPUs in a single distributed training job (16 nodes × 8 GPUs). For most workloads, scaling beyond 64 GPUs has diminishing returns due to communication overhead. Pre-training foundation models (100B+ parameters) benefit from 64-128 GPUs. Fine-tuning rarely needs more than 8-16 GPUs.
Do I pay for inter-GPU network traffic?
No. All GPU-to-GPU communication (NVLink, NVSwitch, InfiniBand) is included in the GPU hourly rate. You only pay a small premium for high-bandwidth interconnects (+$0.10-0.50/hr per GPU). There are no data transfer fees between GPUs in the same cluster. Egress to the internet is free for the first 1TB/month.
How does multi-GPU training on io.net compare to AWS/Azure?
io.net is 65-70% cheaper with comparable performance. An 8x H100 cluster costs $17.60/hr on io.net vs. $55.84/hr on AWS p5.48xlarge. Both use NVIDIA NVSwitch with 900 GB/s interconnect. io.net's decentralized model provides the same high-bandwidth networking (sourced from Tier 1 data centers) at decentralized pricing.
Start Multi-GPU Training on io.net
Train large models 68% cheaper than AWS:
- 8x H100 cluster at $17.60/hr - Perfect for Llama 3 70B
- NVSwitch/InfiniBand networking - 900 GB/s bandwidth
- Pre-configured FSDP, DeepSpeed, Ray - Start training in minutes
- Autoscaling - Scale from 2 to 100+ GPUs dynamically
Launch multi-GPU cluster → or view cluster pricing →
Last updated: April 2026 | Multi-GPU benchmarks based on standard training configurations with optimized NCCL settings
