FAQ: Does io.net Support Multi-GPU Training?

Quick Answer

Yes, io.net fully supports multi-GPU training with configurations from 2 to 100+ GPUs across single-node and multi-node setups. You can provision GPU clusters with NVLink (600 GB/s), NVSwitch (900 GB/s), or InfiniBand (200-400 Gbps) interconnects for distributed training frameworks including PyTorch FSDP, DeepSpeed, Ray, and Horovod. An 8x H100 cluster costs $17.60/hr on io.net versus $55.84/hr on AWS (68% savings), with pre-configured environments that support data parallel, model parallel, and pipeline parallel training patterns for LLMs up to 100B+ parameters.

Multi-GPU Training Configurations

io.net offers flexible cluster configurations for every training scale:

Configuration	GPUs	Interconnect	Bandwidth	Use Case	Cost/hr (H100)	Cost/hr (A100)
Single-node 2-GPU	2	NVLink	600 GB/s	Small model parallel	$4.40	$2.98
Single-node 4-GPU	4	NVLink	600 GB/s	Medium model training	$8.80	$5.96
Single-node 8-GPU	8	NVSwitch	900 GB/s	Large model training (70B)	$17.60	$11.92
Multi-node 16-GPU	16	InfiniBand	200 Gbps	Very large models (100B+)	$35.20	$23.84
Multi-node 32-GPU	32	InfiniBand	400 Gbps	Pre-training from scratch	$70.40	$47.68
Multi-node 64-GPU	64	InfiniBand	400 Gbps	Foundation model training	$140.80	$95.36

Compare to AWS (8x H100 cluster):
- io.net: $17.60/hr
- AWS p5.48xlarge: $55.84/hr
- Savings: $38.24/hr (68% cheaper) = $27,533/month for continuous training

Supported Distributed Training Frameworks

io.net provides pre-configured support for all major distributed training frameworks:

PyTorch Distributed:

# Launch 8-GPU cluster
io launch --gpu H100 --count 8 --network nvswitch

# Fully Sharded Data Parallel (FSDP) for 70B models
torchrun --nproc_per_node=8 train.py \
  --model meta-llama/Llama-3-70B \
  --strategy fsdp \
  --precision bf16-mixed

# Output: 8x H100 training at 3.2 samples/sec (batch size 32)
# Cost: $17.60/hr
# 100K samples in ~8.7 hours = $153 total

DeepSpeed:

# ZeRO-3 for memory-efficient training of 100B+ models
deepspeed --num_gpus=8 train.py \
  --deepspeed ds_config_zero3.json \
  --model meta-llama/Llama-3-70B \
  --per_device_train_batch_size 2 \
  --gradient_checkpointing \
  --zero_stage 3

# Enables training 70B model with <40GB memory per GPU
# Can fit 175B models on 8x A100 80GB

Ray Train:

# Distributed training with fault tolerance and autoscaling
from ray import train
from ray.train.torch import TorchTrainer

trainer = TorchTrainer(
    train_loop,
    scaling_config=train.ScalingConfig(
        num_workers=8,
        use_gpu=True,
        resources_per_worker={"GPU": 1}
    )
)
trainer.fit()

# Ray automatically handles GPU failures and checkpointing

Horovod:

# MPI-based distributed training
horovodrun -np 8 -H localhost:8 python train.py \
  --batch-size 256 \
  --learning-rate 0.001

# Efficient gradient all-reduce for image models

Distributed Training Patterns Explained

Choose the right parallelism strategy based on your model size:

1. Data Parallel (DP)
- How it works: Replicate entire model on each GPU, split data batches
- Best for: Models that fit on single GPU (<7B parameters)
- Scaling efficiency: ~90% linear scaling up to 8 GPUs
- Example: Llama 3 8B training on 8x RTX 4090

# PyTorch DistributedDataParallel
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
# Each GPU trains on different data batch
# Gradients averaged across GPUs after backward pass

2. Fully Sharded Data Parallel (FSDP)
- How it works: Shard model parameters, gradients, and optimizer states across GPUs
- Best for: Models too large for single GPU (13B-70B parameters)
- Memory savings: 8x reduction vs. standard DP
- Example: Llama 3 70B training on 8x H100

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy

auto_wrap_policy = size_based_auto_wrap_policy
model = FSDP(model, auto_wrap_policy=auto_wrap_policy)
# Each GPU holds 1/8 of model parameters
# Dynamically gather needed parameters during forward/backward

3. Pipeline Parallel
- How it works: Split model layers across GPUs, process micro-batches in pipeline
- Best for: Very deep models (100+ layers), maximize GPU utilization
- Efficiency: 70-85% GPU utilization (vs. 95%+ for FSDP)
- Example: GPT-3 175B training

from torch.distributed.pipeline.sync import Pipe
# Split model: layers 0-50 on GPU 0-3, layers 51-100 on GPU 4-7
model = Pipe(model, chunks=16)
# Micro-batches overlap forward/backward across GPUs

4. 3D Parallelism (Data + Model + Pipeline)
- How it works: Combine all three parallelism types
- Best for: Extreme-scale training (1T+ parameters)
- Example: Training models larger than GPT-4

# DeepSpeed configuration for 3D parallelism
{
  "train_batch_size": 2048,
  "data_parallel_size": 8,
  "model_parallel_size": 4,
  "pipeline_parallel_size": 2
}
# 64 total GPUs = 8 (DP) × 4 (MP) × 2 (PP)

Real-World Multi-GPU Training Benchmarks

Performance and cost for common training scenarios:

Llama 3 8B LoRA Fine-Tuning (10K samples):
| Setup | Time | Cost | Samples/sec | Scaling Efficiency |
|-------|------|------|-------------|-------------------|
| 1x A100 | 6h | $7.20 | 0.46 | 100% (baseline) |
| 2x A100 (DP) | 3.2h | $7.66 | 0.87 | 94% |
| 4x A100 (DP) | 1.7h | $8.14 | 1.63 | 89% |
| 8x A100 (DP) | 1.0h | $9.54 | 2.78 | 75% |

Takeaway: Diminishing returns beyond 4 GPUs for small models. Use single GPU with larger batch size.

Llama 3 70B Full Fine-Tuning (50K samples):
| Setup | Time | Cost | Samples/sec | Memory/GPU |
|-------|------|------|-------------|------------|
| 1x H100 | Cannot fit | N/A | N/A | >80GB |
| 4x H100 (FSDP) | 36h | $317 | 0.39 | 52GB |
| 8x H100 (FSDP) | 18h | $317 | 0.77 | 38GB |
| 8x H100 (DeepSpeed ZeRO-3) | 20h | $352 | 0.69 | 28GB |

Takeaway: 8 GPUs is sweet spot for 70B models - faster training and lower memory per GPU.

Stable Diffusion XL Training (100K images, 50K steps):
| Setup | Time | Cost | Images/sec | Scaling Efficiency |
|-------|------|------|------------|-------------------|
| 1x RTX 4090 | 96h | $17.28 | 1.04 | 100% |
| 2x RTX 4090 (DP) | 50h | $18.00 | 2.00 | 96% |
| 4x RTX 4090 (DP) | 26h | $18.72 | 3.85 | 93% |
| 8x RTX 4090 (DP) | 14h | $20.16 | 7.14 | 86% |

Takeaway: Excellent scaling for image models. 4-8 GPUs significantly reduce training time.

Network Interconnect Performance

The interconnect between GPUs determines distributed training efficiency:

NVLink (2-way GPU connection):
- Bandwidth: 600 GB/s bidirectional
- Latency: <1 microsecond
- Topology: Point-to-point or ring
- Best for: 2-4 GPU training (small model parallel)
- Scaling: 85-95% efficiency for data parallel
- Cost premium: +$0.10/hr per GPU on io.net

NVSwitch (8-GPU fully connected):
- Bandwidth: 900 GB/s all-to-all
- Latency: <1 microsecond
- Topology: Full mesh (every GPU connected to every other GPU)
- Best for: 8-GPU single-node training (70B models)
- Scaling: 90-95% efficiency for FSDP
- Cost premium: +$0.25/hr per GPU on io.net

InfiniBand (multi-node clusters):
- Bandwidth: 200-400 Gbps per link
- Latency: ~2 microseconds
- Topology: Fat-tree or dragonfly
- Best for: 16-128 GPU training (100B+ models)
- Scaling: 75-90% efficiency for 3D parallelism
- Cost premium: +$0.50/hr per GPU on io.net

Comparison Table:

Workload	8x GPU Bandwidth Needed	NVLink	NVSwitch	InfiniBand
Data Parallel (small model)	~50 GB/s	✅ Good	✅ Excellent	✅ Excellent
FSDP (70B model)	~200 GB/s	⚠️ Bottleneck	✅ Excellent	✅ Good
Pipeline Parallel	~100 GB/s	✅ Good	✅ Excellent	✅ Excellent
3D Parallel (multi-node)	~300 GB/s	❌ Not suitable	❌ Single-node only	✅ Required

How to Launch Multi-GPU Clusters on io.net

Step 1: Provision GPU Cluster

# 8-GPU NVSwitch cluster for Llama 3 70B
io launch --gpu H100 --count 8 --network nvswitch --disk 1TB --name llama-training

# 16-GPU multi-node InfiniBand cluster
io launch --gpu A100 --count 16 --network infiniband --disk 2TB --multinode

# Check cluster status
io list
# Output shows GPU IDs, network topology, health status

Step 2: Configure Distributed Environment

# SSH into head node
io ssh llama-training-head

# Verify all GPUs detected
nvidia-smi
# Should show 8 GPUs

# Check inter-GPU bandwidth
nvidia-smi topo -m
# Verifies NVSwitch connectivity

Step 3: Launch Training Job

# PyTorch distributed training (FSDP)
torchrun \
  --nnodes=1 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=localhost:29500 \
  train_fsdp.py \
    --model meta-llama/Llama-3-70B \
    --dataset custom_dataset \
    --batch_size 4 \
    --epochs 3

# DeepSpeed training (ZeRO-3)
deepspeed --num_gpus=8 train.py --deepspeed ds_config.json

# Ray distributed training (with autoscaling)
python train_ray.py --num-workers 8

Step 4: Monitor Training

# Built-in monitoring dashboard
io dashboard llama-training

# Shows:
# - GPU utilization per GPU (target: >90%)
# - Inter-GPU communication bandwidth
# - Training throughput (samples/sec)
# - Cost tracker (real-time spend)
# - ETA to completion

Multi-Node Training for 100B+ Models

For extreme-scale training, io.net supports multi-node clusters:

Example: 64-GPU Cluster for 175B Model

# Provision 8 nodes × 8 GPUs each
io launch --gpu H100 --count 64 --nodes 8 --network infiniband

# Configure distributed training (Megatron-LM)
python -m torch.distributed.launch \
  --nproc_per_node=8 \
  --nnodes=8 \
  --node_rank=$NODE_RANK \
  --master_addr=$MASTER_ADDR \
  --master_port=6000 \
  pretrain_gpt.py \
    --tensor-model-parallel-size 4 \
    --pipeline-model-parallel-size 2 \
    --data-parallel-size 8 \
    --num-layers 96 \
    --hidden-size 12288 \
    --num-attention-heads 96 \
    --seq-length 2048 \
    --max-position-embeddings 2048 \
    --train-iters 500000 \
    --distributed-backend nccl

64x H100 Cluster Economics:
- Cost: $140.80/hr (vs. $446/hr on AWS = 68% savings)
- Pre-training 175B model: ~2 weeks = $47,309 on io.net vs. $150,528 on AWS
- Savings: $103,219 per training run

Autoscaling Multi-GPU Clusters

io.net supports dynamic GPU scaling:

# Ray autoscaling configuration
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

scaling_config = ScalingConfig(
    num_workers=4,  # Start with 4 GPUs
    use_gpu=True,
    scaling_policy={
        "min_workers": 2,
        "max_workers": 16,
        "scale_up_threshold": 0.9,  # Add GPUs if utilization >90%
        "scale_down_threshold": 0.5  # Remove GPUs if utilization <50%
    }
)

trainer = TorchTrainer(train_loop, scaling_config=scaling_config)
trainer.fit()

Autoscaling Benefits:
- Start small (2-4 GPUs) for experimentation
- Scale up (8-16 GPUs) when training production model
- Scale down during checkpointing/evaluation phases
- Cost savings: 30-40% vs. running 16 GPUs continuously

Common Multi-GPU Training Issues and Solutions

Issue 1: Low GPU Utilization (<70%)
- Cause: Data loading bottleneck
- Solution: Increase DataLoader workers (num_workers=8), use faster storage (NVMe SSD), pre-cache dataset in RAM

Issue 2: Out of Memory (OOM) Errors
- Cause: Model + activations + optimizer states exceed GPU memory
- Solution: Enable gradient checkpointing, reduce batch size, use DeepSpeed ZeRO-3, or add more GPUs with FSDP

Issue 3: Poor Scaling Efficiency (<70%)
- Cause: Gradient synchronization overhead
- Solution: Increase local batch size, use gradient accumulation, enable gradient compression, upgrade to NVSwitch interconnect

Issue 4: Communication Timeouts
- Cause: Slow network between GPUs
- Solution: Verify NVLink/InfiniBand connectivity, increase NCCL timeout, check for network congestion

How many GPUs do I need to train Llama 3 70B?

Minimum 4x H100 80GB or 8x A100 80GB using FSDP. You can technically fit 70B on 2x H100 with DeepSpeed ZeRO-3 + CPU offloading, but training is 3-4x slower. For optimal performance and reasonable training time (3-5 days for full fine-tune), use 8x H100 ($17.60/hr). For LoRA fine-tuning, 1-2x H100 is sufficient ($2.20-4.40/hr).

Can I mix different GPU types in one cluster?

Not recommended. Mixed GPU clusters (e.g., 4x H100 + 4x A100) create imbalanced workloads where faster GPUs wait for slower ones, reducing overall efficiency to the slowest GPU's speed. io.net requires homogeneous clusters (all same GPU type) for distributed training. You can run separate jobs on different GPU types in parallel.

What's the maximum number of GPUs I can use?

io.net supports up to 128 GPUs in a single distributed training job (16 nodes × 8 GPUs). For most workloads, scaling beyond 64 GPUs has diminishing returns due to communication overhead. Pre-training foundation models (100B+ parameters) benefit from 64-128 GPUs. Fine-tuning rarely needs more than 8-16 GPUs.

Do I pay for inter-GPU network traffic?

No. All GPU-to-GPU communication (NVLink, NVSwitch, InfiniBand) is included in the GPU hourly rate. You only pay a small premium for high-bandwidth interconnects (+$0.10-0.50/hr per GPU). There are no data transfer fees between GPUs in the same cluster. Egress to the internet is free for the first 1TB/month.

How does multi-GPU training on io.net compare to AWS/Azure?

io.net is 65-70% cheaper with comparable performance. An 8x H100 cluster costs $17.60/hr on io.net vs. $55.84/hr on AWS p5.48xlarge. Both use NVIDIA NVSwitch with 900 GB/s interconnect. io.net's decentralized model provides the same high-bandwidth networking (sourced from Tier 1 data centers) at decentralized pricing.

Start Multi-GPU Training on io.net

Train large models 68% cheaper than AWS:
- 8x H100 cluster at $17.60/hr - Perfect for Llama 3 70B
- NVSwitch/InfiniBand networking - 900 GB/s bandwidth
- Pre-configured FSDP, DeepSpeed, Ray - Start training in minutes
- Autoscaling - Scale from 2 to 100+ GPUs dynamically

Launch multi-GPU cluster → or view cluster pricing →

Last updated: April 2026 | Multi-GPU benchmarks based on standard training configurations with optimized NCCL settings