Deploy H100 GPU clusters in 5 minutes. No waitlists. No datacenter complexity. No procurement headaches.
Traditional cloud providers make GPU access painful: AWS p5 instances have 6-12 week waitlists, GCP requires sales calls for quota increases, and Azure demands enterprise agreements. On-premises deployment costs $500K+ and takes months.
io.net eliminates these barriers. Our decentralized GPU network aggregates 200,000+ GPUs from independent datacenters worldwide, offering instant access to H100, A100, and RTX 4090 clusters at 70% lower cost than hyperscalers.
This guide takes you from account creation to running production distributed training workloads in under 30 minutes. You'll deploy your first single GPU instance, scale to an 8-GPU NVLink cluster, and learn best practices for production deployments.
Prerequisites and Account Setup
What You'll Need
Before starting, ensure you have:
- Terminal access: Linux, macOS, or Windows WSL
- SSH client: OpenSSH or PuTTY
- Basic Docker knowledge: Understanding images, containers, and registries
- Payment method: Credit card or cryptocurrency (USDC, SOL supported)
- Programming environment: Python 3.8+ recommended for testing
Optional but helpful:
- Git (for cloning example repositories)
- NVIDIA GPU drivers on local machine (for testing scripts locally first)
Create io.net Account
Step 1: Navigate to https://cloud.io.net/signup
Step 2: Register with email or connect wallet
- Email signup: Verify via confirmation link
- Wallet signup: Connect MetaMask, Phantom, or compatible Web3 wallet
Step 3: Complete KYC (for credit card payments)
- Upload government ID
- Verification typically completes in 10-30 minutes
Step 4: Add payment method
- Credit card: Instant activation
- Crypto: Deposit USDC or SOL to your account wallet (minimum $50 recommended)
Step 5: Get your API key
- Navigate to Account → API Keys
- Click "Generate New Key"
- Copy and save securely (displayed only once)
- Set permissions: Read, Write, Deploy (full access for first key)
Install io.net CLI
The CLI is the fastest way to deploy and manage GPU clusters. Installation takes under 1 minute.
On Linux/macOS:
curl -fsSL https://downloads.io.net/cli/install.sh | bash
On Windows (WSL):
curl -fsSL https://downloads.io.net/cli/install.sh | bash
Verify installation:
io-cli version
# Output: io-cli v2.4.1
Alternative: Install via package managers
Homebrew (macOS):
brew install io-net/tap/io-cli
APT (Debian/Ubuntu):
curl -fsSL https://downloads.io.net/keys/apt.gpg | sudo gpg --dearmor -o /usr/share/keyrings/ionet.gpg
echo "deb [signed-by=/usr/share/keyrings/ionet.gpg] https://downloads.io.net/apt stable main" | sudo tee /etc/apt/sources.list.d/ionet.list
sudo apt update && sudo apt install io-cli
Configure Authentication
Authenticate the CLI with your API key:
io-cli auth login --api-key YOUR_API_KEY_HERE
Success output:
✓ Authentication successful
✓ Logged in as: [email protected]
✓ Credits available: $100.00
Set default region (optional but recommended):
io-cli config set-region us-east
Available regions:
us-east(North Virginia - fastest for US East Coast)us-west(Oregon - fastest for US West Coast)eu-west(Ireland - fastest for Europe)asia-pacific(Singapore - fastest for Asia)
Verify configuration:
io-cli config show
You're now ready to deploy GPUs.
Deploying Your First GPU Instance
Let's deploy a single H100 GPU running PyTorch. This entire process takes under 5 minutes.
List Available GPUs
Check real-time GPU availability:
io-cli gpu list --available
Output:
GPU TYPE COUNT REGIONS PRICE (USD/hr)
h100-sxm 487 us-east, eu-west $2.49
h100-pcie 823 us-east, us-west $1.99
a100-80gb 1247 all regions $1.39
a100-40gb 892 us-east, asia $0.99
rtx-4090 3421 all regions $0.49
The COUNT column shows currently available GPUs across all regions. io.net's decentralized model means capacity is rarely constrained (unlike AWS where p5 instances are perpetually sold out).
Deploy Single H100 GPU
Deploy an H100 PCIe GPU with PyTorch pre-installed:
io-cli deploy create \
--gpu-type h100-pcie \
--gpu-count 1 \
--image pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime \
--name my-first-gpu \
--region us-east
Flags explained:
--gpu-type: Hardware type (h100-pcie, h100-sxm, a100-80gb, etc.)--gpu-count: Number of GPUs (1 for this example)--image: Docker image to run (PyTorch official image from Docker Hub)--name: Human-readable identifier for this deployment--region: Geographic region (affects latency and pricing)
Deployment output:
⠿ Creating deployment my-first-gpu
⠿ Allocating 1x H100 PCIe GPU in us-east
✓ GPU allocated: gpu-8x7k2m
⠿ Pulling image pytorch/pytorch:2.2.0-cuda12.1
✓ Container started
✓ SSH server ready
Deployment ID: dep-9f83jd
SSH Access: ssh [email protected]
Cost: $1.99/hour
Deployment typically completes in 2-3 minutes (most time spent pulling Docker image; subsequent deploys with same image are faster due to caching).
SSH into Instance
Connect to your GPU instance:
io-cli ssh my-first-gpu
Or use standard SSH:
ssh [email protected]
The CLI automatically manages SSH keys for you (stored in ~/.io-net/ssh/). For manual SSH, add your public key in the web console under Account → SSH Keys.
Run Test Workload
Verify GPU is accessible and working:
import torch
import sys
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
# Quick compute test
x = torch.randn(10000, 10000).cuda()
y = torch.matmul(x, x)
print(f"Matrix multiplication successful: {y.shape}")
Expected output:
PyTorch version: 2.2.0
CUDA available: True
CUDA version: 12.1
GPU count: 1
GPU name: NVIDIA H100 PCIe
GPU memory: 80.00 GB
Matrix multiplication successful: torch.Size([10000, 10000])
Congratulations! You've deployed your first GPU on io.net.
Stop or Delete Instance
When finished:
Pause (stop paying but preserve state):
io-cli deploy pause my-first-gpu
Restart:
io-cli deploy resume my-first-gpu
Delete (permanent):
io-cli deploy delete my-first-gpu
io.net charges per second (not per hour like AWS), so you're only billed for actual usage time. Pausing stops billing immediately.
Scaling to Multi-GPU Clusters
Single GPUs are great for experimentation, but production training demands multi-GPU clusters. io.net supports clusters from 2 to 1,000+ GPUs.
Deploy 8-GPU NVLink Cluster
For large language model training, deploy an 8-GPU H100 SXM cluster with NVLink interconnect:
io-cli deploy create \
--gpu-type h100-sxm \
--gpu-count 8 \
--interconnect nvlink \
--image nvcr.io/nvidia/pytorch:24.03-py3 \
--name llama-training-cluster \
--region us-east
Key differences from single-GPU:
--gpu-type h100-sxm: SXM variant has NVLink support (900 GB/s GPU-to-GPU bandwidth)--gpu-count 8: Full 8-GPU node (common for LLM training)--interconnect nvlink: Enable NVLink mesh (critical for multi-GPU performance)
Deployment time: 3-5 minutes (longer due to NVLink initialization and multi-GPU provisioning)
Verify NVLink Connectivity
SSH into the cluster and check NVLink topology:
io-cli ssh llama-training-cluster
nvidia-smi topo -m
Expected output (abbreviated):
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18
...
NV18 indicates NVLink 4.0 connection (900 GB/s bidirectional bandwidth). Every GPU connects directly to every other GPU — full mesh topology.
If you see PHB or SYS instead of NV18, NVLink is not active. Verify you requested h100-sxm (not h100-pcie) and included --interconnect nvlink flag.
Multi-Node Clusters (64 GPUs)
For massive-scale training (GPT-3 class models, 100B+ parameters), deploy multi-node clusters.
Create cluster configuration file (cluster-config.yaml):
name: large-scale-training
gpu_type: h100-sxm
total_gpus: 64
nodes:
- node_count: 8 # 8 nodes
gpus_per_node: 8 # 8 GPUs each = 64 total
interconnect: nvlink # NVLink within each node
network: infiniband # InfiniBand between nodes (400 Gbps)
region: us-east
image: nvcr.io/nvidia/pytorch:24.03-py3
volumes:
- name: training-data
size: 10TB
mount: /data
- name: checkpoints
size: 5TB
mount: /checkpoints
env:
NCCL_DEBUG: INFO
NCCL_IB_DISABLE: 0 # Enable InfiniBand for NCCL
NCCL_SOCKET_IFNAME: ib0
Deploy cluster:
io-cli deploy create --config cluster-config.yaml
Cost calculation:
- 64x H100 SXM @ $2.49/hour = $159.36/hour
- 10 days of training: 240 hours × $159.36 = $38,246
- AWS equivalent (p5.48xlarge): 8x $98.32 = $786.56/hour for 64 GPUs = $188,774 for 10 days
- Savings: $150,528 (79.7% cheaper)
io.net's decentralized model aggregates underutilized datacenter capacity, passing massive cost savings to users.
Distributed Training Setup
Multi-GPU clusters require distributed training frameworks. io.net supports PyTorch DDP, DeepSpeed, Megatron-LM, and Horovod out of the box.
PyTorch Distributed Data Parallel (DDP)
Initialize distributed training in your script:
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup_distributed():
# io.net sets these env vars automatically
dist.init_process_group(backend='nccl')
torch.cuda.set_device(int(os.environ['LOCAL_RANK']))
def cleanup_distributed():
dist.destroy_process_group()
# Wrap model
model = YourModel().cuda()
model = DDP(model)
# Training loop
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()
Launch distributed training:
io-cli exec llama-training-cluster \
"torchrun \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
--master_addr=localhost \
--master_port=29500 \
train.py"
For multi-node training (64 GPUs across 8 nodes):
io-cli exec llama-training-cluster \
"torchrun \
--nproc_per_node=8 \
--nnodes=8 \
--master_addr=$(io-cli cluster info llama-training-cluster --get master-ip) \
--master_port=29500 \
train.py"
io.net automatically configures NCCL (NVIDIA Collective Communications Library) for optimal GPU-to-GPU communication.
DeepSpeed Configuration
For training models that don't fit in single-GPU memory (70B+ parameter models), use DeepSpeed ZeRO.
Create DeepSpeed config (ds_config.json):
{
"train_batch_size": 256,
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 8,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 3e-4,
"betas": [0.9, 0.95],
"eps": 1e-8,
"weight_decay": 0.1
}
},
"fp16": {
"enabled": false
},
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none"
},
"offload_param": {
"device": "none"
},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 5e8,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6
},
"steps_per_print": 100,
"wall_clock_breakdown": false
}
Launch DeepSpeed training:
io-cli exec llama-training-cluster \
"deepspeed --num_gpus=8 train.py --deepspeed_config ds_config.json"
DeepSpeed ZeRO-3 shards model parameters, gradients, and optimizer states across all GPUs, enabling training of 70B+ parameter models that wouldn't fit on a single GPU.
Verify Training Performance
Monitor GPU utilization during training:
io-cli monitor llama-training-cluster --gpu-stats
Healthy training metrics:
- GPU utilization: 90-98% (indicates GPUs actively computing)
- GPU memory: 70-90% (efficient use without OOM risk)
- NVLink utilization: 40-80% (high communication for large models)
- Power draw: 650-700W per GPU (maxed out = good)
If GPU utilization is below 80%, diagnose bottlenecks:
- Low utilization + low NVLink traffic: Data loading bottleneck (speed up data pipeline)
- Low utilization + high NVLink traffic: Communication bottleneck (reduce gradient sync frequency)
- High memory + low compute: Model architecture inefficiency (check for unnecessary copies)
Cost Optimization Best Practices
io.net is already 70-80% cheaper than AWS/GCP, but you can reduce costs further with smart configurations.
1. Use Spot Instances for Fault-Tolerant Workloads
Spot instances offer 40-60% discounts for interruptible workloads:
io-cli deploy create \
--gpu-type h100-pcie \
--gpu-count 8 \
--spot \
--max-price 1.20 \
--checkpoint-interval 3600 \
--name spot-training
How spot works:
- You set a maximum price per GPU-hour (
--max-price) - io.net allocates GPUs at current spot price (fluctuates based on supply/demand)
- If spot price exceeds your max, instance is terminated with 5-minute warning
--checkpoint-intervalauto-saves model every N seconds (resume after interruption)
Best for:
- Long-running training jobs with frequent checkpointing
- Hyperparameter sweeps (each run is independent)
- Inference workloads with retry logic
Avoid for:
- Time-critical deadlines (spot may terminate mid-training)
- Jobs without checkpointing support
Savings example:
- Regular H100 PCIe: $1.99/hour
- Spot H100 PCIe (avg): $0.89/hour
- 55% savings
2. Auto-Shutdown Idle Clusters
Prevent forgotten instances from burning budget:
io-cli deploy update llama-training-cluster \
--idle-timeout 600 # Shut down after 10 minutes of <5% GPU utilization
This is essential for development/experimentation where you might SSH in, start a job, and forget to terminate afterward.
Idle detection: Cluster is considered idle if GPU utilization < 5% for duration of timeout period.
3. Right-Size GPU Type for Workload
Don't overpay for performance you don't need:
| Workload | Optimal GPU | Price | Why |
|---|---|---|---|
| LLaMA 70B training (8+ GPUs) | H100 SXM | $2.49/hr | Needs NVLink for multi-GPU |
| LLaMA 13B fine-tuning (1-2 GPUs) | H100 PCIe | $1.99/hr | No NVLink benefit, save 20% |
| Stable Diffusion training | A100 80GB | $1.39/hr | Sufficient compute, save 30% |
| BERT/GPT-2 training | A100 40GB | $0.99/hr | Fits in 40GB, save 50% |
| Inference serving | RTX 4090 | $0.49/hr | Inference doesn't need datacenter GPU, save 75% |
Cost impact: Using A100 instead of H100 for workloads that don't benefit from Hopper's FP8 Tensor Cores = 44% savings.
4. Regional Pricing Arbitrage
GPU prices vary by region based on local datacenter costs:
| Region | H100 SXM | H100 PCIe | A100 80GB |
|---|---|---|---|
| us-east | $2.49/hr | $1.99/hr | $1.39/hr |
| us-west | $2.49/hr | $1.99/hr | $1.39/hr |
| eu-west | $2.65/hr (+6%) | $2.12/hr (+7%) | $1.49/hr (+7%) |
| asia-pacific | $2.79/hr (+12%) | $2.23/hr (+12%) | $1.59/hr (+14%) |
If latency isn't critical, deploy in lowest-cost region:
io-cli deploy create --region us-east # Cheapest for H100
For inference serving global users, deploy in multiple regions (users hit nearest region, reducing latency):
# Multi-region inference deployment
deployments:
- region: us-east
gpus: 4
- region: eu-west
gpus: 4
- region: asia-pacific
gpus: 4
load_balancer: geo-routing # Route users to nearest region
5. Use Persistent Volumes Wisely
Persistent storage costs $0.10/GB/month. For large datasets, this adds up:
- 10TB dataset: $1,024/month
- Downloading from S3 each training run: $92/TB (egress) + time
Optimization:
- Store datasets in io.net volumes (faster access, no egress fees)
- Delete volumes when not actively training (re-upload for next run if infrequent)
- Use snapshot backups for long-term storage ($0.05/GB/month, 50% cheaper)
Projected Cost Savings
Example: 3-month LLaMA 70B training project
| Resource | io.net Optimized | io.net Standard | AWS |
|---|---|---|---|
| Training (8x H100 SXM, 20 days) | $9,552 (spot) | $15,920 | $78,643 |
| Experimentation (4x A100, 60 days) | $8,006 | $13,344 | $42,336 |
| Inference (8x RTX 4090, 90 days) | $8,467 | $8,467 | $34,560 (A100 equiv) |
| Storage (10TB, 90 days) | $3,072 | $3,072 | $7,680 (EBS) |
| Total | $29,097 | $40,803 | $163,219 |
Optimized io.net config saves $134,122 (82%) vs AWS.
Monitoring and Management
Production deployments require observability. io.net provides built-in monitoring for GPUs, costs, and workloads.
Real-Time GPU Utilization
Monitor GPU metrics in real-time:
io-cli monitor llama-training-cluster --gpu-stats --refresh 5
Output (updates every 5 seconds):
GPU UTIL MEM TEMP POWER NVLINK
0 97% 74GB/80 68°C 685W 620 GB/s
1 96% 73GB/80 69°C 690W 615 GB/s
2 98% 75GB/80 67°C 680W 625 GB/s
...
Alerts: Set up alerts for anomalies:
io-cli monitor alert create \
--cluster llama-training-cluster \
--condition "gpu_util < 50 for 10min" \
--action slack-webhook \
--webhook https://hooks.slack.com/services/YOUR/WEBHOOK
Cost Tracking
Track spending in real-time:
io-cli billing usage --cluster llama-training-cluster
Output:
Cluster: llama-training-cluster
Runtime: 47h 23m
GPU hours: 379.1 (8 GPUs × 47.4h)
Cost: $944.37
Projected monthly: $14,320
Budget alerts:
io-cli billing alert create \
--threshold 1000 \
--period daily \
--email [email protected]
Alert triggers if daily spending exceeds $1,000 (useful for catching runaway jobs).
Logs and Debugging
Access container logs:
io-cli logs llama-training-cluster --tail 100 --follow
Filter logs:
io-cli logs llama-training-cluster --grep "ERROR" --since 1h
Download logs for analysis:
io-cli logs llama-training-cluster --download logs.txt
For distributed training debugging, NCCL logs are crucial:
io-cli exec llama-training-cluster "cat /tmp/nccl_debug.log"
Look for NCCL errors like "Network unreachable" (indicates inter-node networking issue) or "Topology detection failed" (NVLink misconfiguration).
Common Errors and Troubleshooting
Error: "Insufficient GPU capacity in region us-east"
Cause: Temporary capacity constraint (rare on io.net, but possible).
Solutions:
- Try different region:
io-cli deploy create --region us-west # Or eu-west, asia-pacific
- Wait for capacity (queue request):
io-cli deploy create --wait --timeout 3600 # Wait up to 1 hour
- Use different GPU type:
io-cli deploy create --gpu-type a100-80gb # More availability
io.net's decentralized model means capacity constraints are rare (200K+ GPUs across 200+ datacenters), unlike AWS where p5 instances are perpetually sold out.
Error: "NCCL initialization failed"
Cause: Multi-GPU distributed training can't establish communication between GPUs.
Common reasons:
- Missing NVLink interconnect (for SXM GPUs):
# Verify you requested NVLink
io-cli deploy show llama-training-cluster | grep interconnect
# Should show: interconnect: nvlink
Fix: Redeploy with --interconnect nvlink flag.
- Firewall blocking NCCL ports:
# Check if NCCL can bind to ports
io-cli exec llama-training-cluster "netstat -tuln | grep 29500"
Fix: Ensure security group allows inbound traffic on ports 29400-29600 (NCCL default range).
- Wrong NCCL backend:
# Ensure using NCCL backend (not gloo or mpi)
dist.init_process_group(backend='nccl') # Correct for NVIDIA GPUs
Debugging: Enable NCCL debug logging:
export NCCL_DEBUG=INFO
python train.py
Check logs for specific error (e.g., "Network xyz not found" indicates network interface naming issue).
Error: "CUDA out of memory (OOM)"
Cause: Model + optimizer state + activations exceed 80GB GPU memory.
Solutions:
- Reduce batch size:
# Instead of batch_size=32
batch_size = 16 # Or 8, 4, etc.
- Enable gradient checkpointing (trade compute for memory):
from torch.utils.checkpoint import checkpoint
class MyModel(nn.Module):
def forward(self, x):
return checkpoint(self.layer1, x) # Recompute layer1 activations in backward pass
Reduces memory usage by 30-50% at cost of 20-30% slower training.
- Use DeepSpeed ZeRO-3 (shards model across GPUs):
{
"zero_optimization": {
"stage": 3 # Shard params, gradients, optimizer states
}
}
Enables training models 8x larger than single-GPU memory.
- Use mixed precision training (FP16/BF16 instead of FP32):
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
with autocast(): # Use FP16 for forward pass
loss = model(batch)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Reduces memory usage by 50% (FP16 is half the size of FP32).
Error: "SSH connection timeout"
Cause: Instance is still initializing (pulling Docker image, starting SSH server).
Solution: Wait 2-3 minutes and retry. First deployment with new Docker image takes longer (image download). Subsequent deployments with cached image are faster (<1 min).
Check status:
io-cli deploy status llama-training-cluster
Wait for status: Running and SSH: Ready before connecting.
If timeout persists after 5 minutes, check security groups:
io-cli deploy show llama-training-cluster --security-group
Ensure port 22 (SSH) is open for inbound traffic from your IP.
Error: "Deployment failed: Payment method declined"
Cause: Credit card declined or insufficient credits.
Solutions:
- Check billing:
io-cli billing status
- Add credits:
io-cli billing add-credits --amount 100 # Add $100
- Update payment method (if card expired):
io-cli billing payment-method update
io.net requires minimum $10 credits for first deployment. Afterward, billing is automatic (charged after usage, not pre-paid).
Advanced Configurations
Custom Docker Images
Use your own Docker images with pre-installed dependencies:
Build custom image:
FROM nvcr.io/nvidia/pytorch:24.03-py3
RUN pip install transformers accelerate datasets
COPY ./my-training-code /workspace
WORKDIR /workspace
Push to registry:
docker build -t yourregistry.io/custom-pytorch:latest .
docker push yourregistry.io/custom-pytorch:latest
Deploy with custom image:
io-cli deploy create \
--image yourregistry.io/custom-pytorch:latest \
--gpu-type h100-pcie \
--gpu-count 8
io.net supports:
- Docker Hub (public and private with credentials)
- NVIDIA NGC Registry
- Google Container Registry (GCR)
- Amazon ECR
- Azure Container Registry
- Self-hosted registries
Private registry authentication:
io-cli deploy create \
--image yourregistry.io/private-image:latest \
--registry-auth username:password
Persistent Storage
Attach persistent volumes for datasets and checkpoints:
Create volume:
io-cli volume create \
--name training-data \
--size 1TB \
--region us-east
Attach to deployment:
io-cli deploy create \
--gpu-type h100-sxm \
--gpu-count 8 \
--attach-volume training-data:/data \
--name my-cluster
Volume is mounted at /data inside container. Data persists across deployments (stop/start cluster, data remains).
Upload data to volume:
# Option 1: Upload from local machine
io-cli volume upload training-data ./local-dataset/ /data/
# Option 2: Download from S3
io-cli exec my-cluster "aws s3 cp s3://my-bucket/dataset /data/ --recursive"
# Option 3: Use io.net's transfer service (faster for large datasets)
io-cli volume import training-data s3://my-bucket/dataset
Snapshots (for backups):
io-cli volume snapshot create training-data --name backup-2026-04-24
Restore from snapshot:
io-cli volume create --from-snapshot backup-2026-04-24 --name restored-data
Multi-Region Deployments
Deploy inference serving across multiple regions for low latency globally:
Configuration file (multi-region-inference.yaml):
deployments:
- name: inference-us
region: us-east
gpu_type: rtx-4090
gpu_count: 4
image: myregistry/llm-serve:latest
- name: inference-eu
region: eu-west
gpu_type: rtx-4090
gpu_count: 4
image: myregistry/llm-serve:latest
- name: inference-asia
region: asia-pacific
gpu_type: rtx-4090
gpu_count: 2
image: myregistry/llm-serve:latest
load_balancer:
enabled: true
routing: geo # Route users to nearest region
health_check: /health
fallback: us-east # If region unavailable
Deploy:
io-cli deploy create --config multi-region-inference.yaml
io.net provisions instances in all three regions and configures geo-routing load balancer automatically.
Access endpoint:
https://multi-region-inference.io.net/v1/completions
Users in US hit inference-us, European users hit inference-eu, etc. Reduces latency by 100-300ms vs single-region deployment.

Production Best Practices
1. Use Configuration Files (Not CLI Args)
For reproducible deployments, store configs in Git:
cluster-config.yaml:
name: production-training
gpu_type: h100-sxm
gpu_count: 64
nodes: 8
gpus_per_node: 8
interconnect: nvlink
network: infiniband
image: myregistry/llm-training:v1.2.3
volumes:
- name: datasets
mount: /data
- name: checkpoints
mount: /checkpoints
env:
WANDB_API_KEY: ${WANDB_API_KEY}
HF_TOKEN: ${HF_TOKEN}
tags:
team: research
project: llama3-finetune
cost-center: ml-training
Deploy:
io-cli deploy create --config cluster-config.yaml
Version control cluster-config.yaml — easy to reproduce deployments, audit changes, and roll back to previous configs.
2. Tag Resources for Cost Attribution
Attribute GPU costs to teams, projects, or customers:
io-cli deploy create \
--tags team=research,project=llama3,env=prod \
--gpu-type h100-sxm \
--gpu-count 8
Cost report by tag:
io-cli billing usage --group-by team
Output:
Team GPU Hours Cost
research 1,247 $3,105
eng 892 $1,769
data-sci 456 $1,138
Essential for chargeback models (allocating cloud costs to internal teams/projects).
3. Set Budget Alerts
Prevent budget overruns:
io-cli billing alert create \
--threshold 5000 \
--period monthly \
--action email \
--email [email protected],[email protected]
Alert triggers if monthly spending exceeds $5,000. Adjust threshold based on budget.
Per-cluster budgets:
io-cli billing alert create \
--cluster production-training \
--threshold 500 \
--period daily
4. Enable Auto-Scaling for Inference
Handle variable load without overpaying:
Auto-scaling config:
name: inference-cluster
gpu_type: rtx-4090
autoscaling:
enabled: true
min_gpus: 2
max_gpus: 16
target_utilization: 70%
scale_up_threshold: 80%
scale_down_threshold: 40%
cooldown: 300 # Wait 5 min before scaling again
How it works:
- If GPU utilization > 80% for 2 minutes → add GPUs (up to max_gpus)
- If GPU utilization < 40% for 5 minutes → remove GPUs (down to min_gpus)
- Ensures 70% average utilization (efficient cost vs latency tradeoff)
Cost impact:
- Without auto-scaling: 16 GPUs × 24h × 30 days × $0.49 = $5,645/month
- With auto-scaling (avg 6 GPUs): 6 × 24 × 30 × $0.49 = $2,116/month
- Savings: $3,529 (62%)
5. Implement Health Checks
Ensure failed deployments are automatically replaced:
health_check:
enabled: true
endpoint: /health # HTTP endpoint that returns 200 if healthy
interval: 30 # Check every 30 seconds
timeout: 5 # Fail if endpoint doesn't respond in 5s
unhealthy_threshold: 3 # Mark unhealthy after 3 consecutive failures
auto_replace: true # Automatically replace unhealthy instances
If instance fails health check (GPU crash, CUDA error, OOM), io.net automatically terminates and replaces with new instance.
6. Use Canary Deployments for Updates
When updating model versions, avoid downtime with canary releases:
deployments:
- name: inference-v1
gpus: 8
weight: 90 # 90% of traffic
image: myregistry/model:v1.2
- name: inference-v2
gpus: 2
weight: 10 # 10% of traffic (canary)
image: myregistry/model:v1.3
load_balancer:
enabled: true
routing: weighted
Process:
- Deploy v1.3 with 10% traffic weight (canary)
- Monitor error rates, latency, quality metrics
- If metrics look good, gradually increase v1.3 weight (10% → 25% → 50% → 100%)
- Retire v1.2 once v1.3 is stable at 100%
Reduces risk of bad deployments taking down production.
Frequently Asked Questions
How long does deployment take?
Single GPU: 2-3 minutes (mostly Docker image pull time)
8-GPU cluster: 3-5 minutes (includes NVLink initialization)
64-GPU multi-node: 7-10 minutes (includes InfiniBand network setup)
Subsequent deployments with cached Docker images: 30-60 seconds.
io.net is 30-50x faster than AWS (p5 instance waitlists are 6-12 weeks).
Can I pause/resume clusters to save money?
Yes:
io-cli deploy pause my-cluster # Stop billing immediately
io-cli deploy resume my-cluster # Resume from exact state
Paused state:
- No GPU charges (only storage charges for attached volumes)
- All data in memory is lost (disk data persists)
- Resume time: 60-90 seconds
Use case: Pause overnight (save 16 hours × $159/hour = $2,544/day for 64-GPU cluster).
What happens if a GPU fails mid-training?
io.net's fault tolerance:
- Automatic detection: Health monitors detect GPU failure (CUDA error, hardware fault)
- Notification: Alert sent to your configured webhook/email
- Replacement: New GPU provisioned automatically (if
auto_replace: truein config) - Checkpoint recovery: Resume training from last checkpoint
How to enable:
fault_tolerance:
auto_replace: true
checkpoint_interval: 3600 # Save checkpoint every hour
checkpoint_path: /checkpoints
Best practice: Always enable checkpointing for long-running training jobs. Even without hardware failures, checkpointing protects against OOM errors, software bugs, and accidental termination.
Can I mix GPU types in one cluster?
Not recommended. Distributed training frameworks assume homogeneous hardware. Mixing creates:
- Bottlenecks (slowest GPU becomes bottleneck)
- Load imbalance (some GPUs finish before others → wasted compute)
- Debugging complexity
Better approach: Run separate clusters for each workload:
- H100 SXM cluster: Large model training
- H100 PCIe cluster: Inference
- A100 cluster: Experimentation
How do I transfer large datasets to the cluster?
Option 1: Upload from local machine (for <100GB):
io-cli volume upload my-volume ./local-data/ /data/
Option 2: Download from S3 (fastest for large datasets):
io-cli exec my-cluster "aws s3 cp s3://bucket/data /data/ --recursive"
No egress fees from S3 to io.net (unlike S3→AWS where egress is free).
Option 3: Use io.net's transfer service (for multi-TB datasets):
io-cli volume import my-volume s3://bucket/data --parallel 32
Parallelizes download across 32 threads — 10x faster than single-threaded aws s3 cp.
Option 4: Peer with existing cloud storage:
volumes:
- type: s3
bucket: my-bucket
region: us-east-1
mount: /data
cache: true # Cache in local SSD for faster access
Transparently mounts S3 bucket as filesystem. Data is fetched on-demand (lazy loading).
What's the minimum billing increment?
Per-second billing — you pay for exactly the time GPUs are running.
Example:
- Start cluster at 10:00:00
- Stop at 10:37:42
- Billed for: 37 minutes 42 seconds = 0.628 hours
- Cost: 8 GPUs × $2.49/hour × 0.628 = $12.51
AWS bills per-hour (minimum 1 hour). If you stop at 10:37:42, you pay for full hour ($19.92). io.net saves $7.41 (37%) on this session alone.
Can I use Kubernetes instead of CLI?
Yes. io.net supports Kubernetes deployments:
# Install io.net Kubernetes provider
io-cli k8s install
# Deploy cluster via kubectl
kubectl apply -f cluster-manifest.yaml
Sample Kubernetes manifest:
apiVersion: io.net/v1
kind: GPUCluster
metadata:
name: training-cluster
spec:
gpuType: h100-sxm
gpuCount: 8
image: pytorch/pytorch:2.2.0
interconnect: nvlink
Kubernetes integration useful for teams already using K8s for orchestration.
How do I get support for cluster issues?
Support channels:
- Documentation: https://docs.io.net
- Community Discord: https://discord.gg/ionet (response time: <30 min)
- Support tickets: [email protected] (response SLA: 4 hours for production issues)
- Enterprise support: Dedicated Slack channel for enterprise customers
When filing support ticket, include:
- Cluster ID (
io-cli deploy show <name> --id) - Error logs (
io-cli logs <name> --download) - Steps to reproduce
- Expected vs actual behavior
Conclusion
io.net transforms GPU access from a months-long procurement nightmare into a 5-minute API call.
What you learned:
- Deploy single GPUs and massive multi-node clusters instantly
- Configure distributed training with PyTorch DDP and DeepSpeed
- Optimize costs with spot instances, auto-shutdown, and right-sized GPU selection
- Monitor and manage production workloads
- Troubleshoot common deployment errors
io.net advantages over traditional cloud:
- ✅ Instant availability: No waitlists (vs 6-12 weeks on AWS)
- ✅ 70-80% cheaper: $2.49/hour for H100 (vs $12.29 on AWS)
- ✅ Per-second billing: Pay exactly for usage (vs per-hour on AWS)
- ✅ 200K+ GPUs: Scale from 1 to 1,000+ GPUs in minutes
- ✅ Decentralized: No vendor lock-in, no single point of failure
Next Steps
1. Create account: https://cloud.io.net/signup (2 minutes)
2. Deploy first GPU:
io-cli deploy create --gpu-type h100-pcie --gpu-count 1
3. Scale to production cluster:
io-cli deploy create --config cluster-config.yaml
4. Join community: https://discord.gg/ionet (ask questions, share learnings)
Get started now — deploy your first GPU cluster in under 5 minutes.