io.net GPU Cluster Setup: Complete Deployment Guide for AI Workloads

Deploy H100 GPU clusters in 5 minutes. No waitlists. No datacenter complexity. No procurement headaches.

Traditional cloud providers make GPU access painful: AWS p5 instances have 6-12 week waitlists, GCP requires sales calls for quota increases, and Azure demands enterprise agreements. On-premises deployment costs $500K+ and takes months.

io.net eliminates these barriers. Our decentralized GPU network aggregates 200,000+ GPUs from independent datacenters worldwide, offering instant access to H100, A100, and RTX 4090 clusters at 70% lower cost than hyperscalers.

This guide takes you from account creation to running production distributed training workloads in under 30 minutes. You'll deploy your first single GPU instance, scale to an 8-GPU NVLink cluster, and learn best practices for production deployments.

Prerequisites and Account Setup

What You'll Need

Before starting, ensure you have:

Terminal access: Linux, macOS, or Windows WSL
SSH client: OpenSSH or PuTTY
Basic Docker knowledge: Understanding images, containers, and registries
Payment method: Credit card or cryptocurrency (USDC, SOL supported)
Programming environment: Python 3.8+ recommended for testing

Optional but helpful:

Git (for cloning example repositories)
NVIDIA GPU drivers on local machine (for testing scripts locally first)

Create io.net Account

Step 1: Navigate to https://cloud.io.net/signup

Step 2: Register with email or connect wallet

Email signup: Verify via confirmation link
Wallet signup: Connect MetaMask, Phantom, or compatible Web3 wallet

Step 3: Complete KYC (for credit card payments)

Upload government ID
Verification typically completes in 10-30 minutes

Step 4: Add payment method

Credit card: Instant activation
Crypto: Deposit USDC or SOL to your account wallet (minimum $50 recommended)

Step 5: Get your API key

Navigate to Account → API Keys
Click "Generate New Key"
Copy and save securely (displayed only once)
Set permissions: Read, Write, Deploy (full access for first key)

Install io.net CLI

The CLI is the fastest way to deploy and manage GPU clusters. Installation takes under 1 minute.

On Linux/macOS:

curl -fsSL https://downloads.io.net/cli/install.sh | bash

On Windows (WSL):

curl -fsSL https://downloads.io.net/cli/install.sh | bash

Verify installation:

io-cli version
# Output: io-cli v2.4.1

Alternative: Install via package managers

Homebrew (macOS):

brew install io-net/tap/io-cli

APT (Debian/Ubuntu):

curl -fsSL https://downloads.io.net/keys/apt.gpg | sudo gpg --dearmor -o /usr/share/keyrings/ionet.gpg
echo "deb [signed-by=/usr/share/keyrings/ionet.gpg] https://downloads.io.net/apt stable main" | sudo tee /etc/apt/sources.list.d/ionet.list
sudo apt update && sudo apt install io-cli

Configure Authentication

Authenticate the CLI with your API key:

io-cli auth login --api-key YOUR_API_KEY_HERE

Success output:

✓ Authentication successful
✓ Logged in as: [email protected]
✓ Credits available: $100.00

Set default region (optional but recommended):

io-cli config set-region us-east

Available regions:

us-east (North Virginia - fastest for US East Coast)
us-west (Oregon - fastest for US West Coast)
eu-west (Ireland - fastest for Europe)
asia-pacific (Singapore - fastest for Asia)

Verify configuration:

io-cli config show

You're now ready to deploy GPUs.

Deploying Your First GPU Instance

Let's deploy a single H100 GPU running PyTorch. This entire process takes under 5 minutes.

List Available GPUs

Check real-time GPU availability:

io-cli gpu list --available

Output:

GPU TYPE       COUNT  REGIONS          PRICE (USD/hr)
h100-sxm       487    us-east, eu-west     $2.49
h100-pcie      823    us-east, us-west     $1.99
a100-80gb      1247   all regions          $1.39
a100-40gb      892    us-east, asia        $0.99
rtx-4090       3421   all regions          $0.49

The COUNT column shows currently available GPUs across all regions. io.net's decentralized model means capacity is rarely constrained (unlike AWS where p5 instances are perpetually sold out).

Deploy Single H100 GPU

Deploy an H100 PCIe GPU with PyTorch pre-installed:

io-cli deploy create \
  --gpu-type h100-pcie \
  --gpu-count 1 \
  --image pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime \
  --name my-first-gpu \
  --region us-east

Flags explained:

--gpu-type: Hardware type (h100-pcie, h100-sxm, a100-80gb, etc.)
--gpu-count: Number of GPUs (1 for this example)
--image: Docker image to run (PyTorch official image from Docker Hub)
--name: Human-readable identifier for this deployment
--region: Geographic region (affects latency and pricing)

Deployment output:

⠿ Creating deployment my-first-gpu
⠿ Allocating 1x H100 PCIe GPU in us-east
✓ GPU allocated: gpu-8x7k2m
⠿ Pulling image pytorch/pytorch:2.2.0-cuda12.1
✓ Container started
✓ SSH server ready

Deployment ID: dep-9f83jd
SSH Access: ssh [email protected]
Cost: $1.99/hour

Deployment typically completes in 2-3 minutes (most time spent pulling Docker image; subsequent deploys with same image are faster due to caching).

SSH into Instance

Connect to your GPU instance:

io-cli ssh my-first-gpu

Or use standard SSH:

ssh [email protected]

The CLI automatically manages SSH keys for you (stored in ~/.io-net/ssh/). For manual SSH, add your public key in the web console under Account → SSH Keys.

Run Test Workload

Verify GPU is accessible and working:

import torch
import sys

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

# Quick compute test
x = torch.randn(10000, 10000).cuda()
y = torch.matmul(x, x)
print(f"Matrix multiplication successful: {y.shape}")

Expected output:

PyTorch version: 2.2.0
CUDA available: True
CUDA version: 12.1
GPU count: 1
GPU name: NVIDIA H100 PCIe
GPU memory: 80.00 GB
Matrix multiplication successful: torch.Size([10000, 10000])

Congratulations! You've deployed your first GPU on io.net.

Stop or Delete Instance

When finished:

Pause (stop paying but preserve state):

io-cli deploy pause my-first-gpu

Restart:

io-cli deploy resume my-first-gpu

Delete (permanent):

io-cli deploy delete my-first-gpu

io.net charges per second (not per hour like AWS), so you're only billed for actual usage time. Pausing stops billing immediately.

Scaling to Multi-GPU Clusters

Single GPUs are great for experimentation, but production training demands multi-GPU clusters. io.net supports clusters from 2 to 1,000+ GPUs.

Deploy 8-GPU NVLink Cluster

For large language model training, deploy an 8-GPU H100 SXM cluster with NVLink interconnect:

io-cli deploy create \
  --gpu-type h100-sxm \
  --gpu-count 8 \
  --interconnect nvlink \
  --image nvcr.io/nvidia/pytorch:24.03-py3 \
  --name llama-training-cluster \
  --region us-east

Key differences from single-GPU:

--gpu-type h100-sxm: SXM variant has NVLink support (900 GB/s GPU-to-GPU bandwidth)
--gpu-count 8: Full 8-GPU node (common for LLM training)
--interconnect nvlink: Enable NVLink mesh (critical for multi-GPU performance)

Deployment time: 3-5 minutes (longer due to NVLink initialization and multi-GPU provisioning)

Verify NVLink Connectivity

SSH into the cluster and check NVLink topology:

io-cli ssh llama-training-cluster
nvidia-smi topo -m

Expected output (abbreviated):

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18
...

NV18 indicates NVLink 4.0 connection (900 GB/s bidirectional bandwidth). Every GPU connects directly to every other GPU — full mesh topology.

If you see PHB or SYS instead of NV18, NVLink is not active. Verify you requested h100-sxm (not h100-pcie) and included --interconnect nvlink flag.

Multi-Node Clusters (64 GPUs)

For massive-scale training (GPT-3 class models, 100B+ parameters), deploy multi-node clusters.

Create cluster configuration file (cluster-config.yaml):

name: large-scale-training
gpu_type: h100-sxm
total_gpus: 64
nodes:
  - node_count: 8  # 8 nodes
    gpus_per_node: 8  # 8 GPUs each = 64 total
    interconnect: nvlink  # NVLink within each node
    network: infiniband  # InfiniBand between nodes (400 Gbps)
    region: us-east

image: nvcr.io/nvidia/pytorch:24.03-py3
volumes:
  - name: training-data
    size: 10TB
    mount: /data
  - name: checkpoints
    size: 5TB
    mount: /checkpoints

env:
  NCCL_DEBUG: INFO
  NCCL_IB_DISABLE: 0  # Enable InfiniBand for NCCL
  NCCL_SOCKET_IFNAME: ib0

Deploy cluster:

io-cli deploy create --config cluster-config.yaml

Cost calculation:

64x H100 SXM @ $2.49/hour = $159.36/hour
10 days of training: 240 hours × $159.36 = $38,246
AWS equivalent (p5.48xlarge): 8x $98.32 = $786.56/hour for 64 GPUs = $188,774 for 10 days
Savings: $150,528 (79.7% cheaper)

io.net's decentralized model aggregates underutilized datacenter capacity, passing massive cost savings to users.

Distributed Training Setup

Multi-GPU clusters require distributed training frameworks. io.net supports PyTorch DDP, DeepSpeed, Megatron-LM, and Horovod out of the box.

PyTorch Distributed Data Parallel (DDP)

Initialize distributed training in your script:

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup_distributed():
    # io.net sets these env vars automatically
    dist.init_process_group(backend='nccl')
    torch.cuda.set_device(int(os.environ['LOCAL_RANK']))

def cleanup_distributed():
    dist.destroy_process_group()

# Wrap model
model = YourModel().cuda()
model = DDP(model)

# Training loop
for batch in dataloader:
    loss = model(batch)
    loss.backward()
    optimizer.step()

Launch distributed training:

io-cli exec llama-training-cluster \
  "torchrun \
   --nproc_per_node=8 \
   --nnodes=1 \
   --node_rank=0 \
   --master_addr=localhost \
   --master_port=29500 \
   train.py"

For multi-node training (64 GPUs across 8 nodes):

io-cli exec llama-training-cluster \
  "torchrun \
   --nproc_per_node=8 \
   --nnodes=8 \
   --master_addr=$(io-cli cluster info llama-training-cluster --get master-ip) \
   --master_port=29500 \
   train.py"

io.net automatically configures NCCL (NVIDIA Collective Communications Library) for optimal GPU-to-GPU communication.

DeepSpeed Configuration

For training models that don't fit in single-GPU memory (70B+ parameter models), use DeepSpeed ZeRO.

Create DeepSpeed config (ds_config.json):

{
  "train_batch_size": 256,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 8,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 3e-4,
      "betas": [0.9, 0.95],
      "eps": 1e-8,
      "weight_decay": 0.1
    }
  },
  "fp16": {
    "enabled": false
  },
  "bf16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "none"
    },
    "offload_param": {
      "device": "none"
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_param_persistence_threshold": 1e6
  },
  "steps_per_print": 100,
  "wall_clock_breakdown": false
}

Launch DeepSpeed training:

io-cli exec llama-training-cluster \
  "deepspeed --num_gpus=8 train.py --deepspeed_config ds_config.json"

DeepSpeed ZeRO-3 shards model parameters, gradients, and optimizer states across all GPUs, enabling training of 70B+ parameter models that wouldn't fit on a single GPU.

Verify Training Performance

Monitor GPU utilization during training:

io-cli monitor llama-training-cluster --gpu-stats

Healthy training metrics:

GPU utilization: 90-98% (indicates GPUs actively computing)
GPU memory: 70-90% (efficient use without OOM risk)
NVLink utilization: 40-80% (high communication for large models)
Power draw: 650-700W per GPU (maxed out = good)

If GPU utilization is below 80%, diagnose bottlenecks:

Low utilization + low NVLink traffic: Data loading bottleneck (speed up data pipeline)
Low utilization + high NVLink traffic: Communication bottleneck (reduce gradient sync frequency)
High memory + low compute: Model architecture inefficiency (check for unnecessary copies)

Cost Optimization Best Practices

io.net is already 70-80% cheaper than AWS/GCP, but you can reduce costs further with smart configurations.

1. Use Spot Instances for Fault-Tolerant Workloads

Spot instances offer 40-60% discounts for interruptible workloads:

io-cli deploy create \
  --gpu-type h100-pcie \
  --gpu-count 8 \
  --spot \
  --max-price 1.20 \
  --checkpoint-interval 3600 \
  --name spot-training

How spot works:

You set a maximum price per GPU-hour (--max-price)
io.net allocates GPUs at current spot price (fluctuates based on supply/demand)
If spot price exceeds your max, instance is terminated with 5-minute warning
--checkpoint-interval auto-saves model every N seconds (resume after interruption)

Best for:

Long-running training jobs with frequent checkpointing
Hyperparameter sweeps (each run is independent)
Inference workloads with retry logic

Avoid for:

Time-critical deadlines (spot may terminate mid-training)
Jobs without checkpointing support

Savings example:

Regular H100 PCIe: $1.99/hour
Spot H100 PCIe (avg): $0.89/hour
55% savings

2. Auto-Shutdown Idle Clusters

Prevent forgotten instances from burning budget:

io-cli deploy update llama-training-cluster \
  --idle-timeout 600  # Shut down after 10 minutes of <5% GPU utilization

This is essential for development/experimentation where you might SSH in, start a job, and forget to terminate afterward.

Idle detection: Cluster is considered idle if GPU utilization < 5% for duration of timeout period.

3. Right-Size GPU Type for Workload

Don't overpay for performance you don't need:

Workload	Optimal GPU	Price	Why
LLaMA 70B training (8+ GPUs)	H100 SXM	$2.49/hr	Needs NVLink for multi-GPU
LLaMA 13B fine-tuning (1-2 GPUs)	H100 PCIe	$1.99/hr	No NVLink benefit, save 20%
Stable Diffusion training	A100 80GB	$1.39/hr	Sufficient compute, save 30%
BERT/GPT-2 training	A100 40GB	$0.99/hr	Fits in 40GB, save 50%
Inference serving	RTX 4090	$0.49/hr	Inference doesn't need datacenter GPU, save 75%

Cost impact: Using A100 instead of H100 for workloads that don't benefit from Hopper's FP8 Tensor Cores = 44% savings.

4. Regional Pricing Arbitrage

GPU prices vary by region based on local datacenter costs:

Region	H100 SXM	H100 PCIe	A100 80GB
us-east	$2.49/hr	$1.99/hr	$1.39/hr
us-west	$2.49/hr	$1.99/hr	$1.39/hr
eu-west	$2.65/hr (+6%)	$2.12/hr (+7%)	$1.49/hr (+7%)
asia-pacific	$2.79/hr (+12%)	$2.23/hr (+12%)	$1.59/hr (+14%)

If latency isn't critical, deploy in lowest-cost region:

io-cli deploy create --region us-east  # Cheapest for H100

For inference serving global users, deploy in multiple regions (users hit nearest region, reducing latency):

# Multi-region inference deployment
deployments:
  - region: us-east
    gpus: 4
  - region: eu-west
    gpus: 4
  - region: asia-pacific
    gpus: 4
load_balancer: geo-routing  # Route users to nearest region

5. Use Persistent Volumes Wisely

Persistent storage costs $0.10/GB/month. For large datasets, this adds up:

10TB dataset: $1,024/month
Downloading from S3 each training run: $92/TB (egress) + time

Optimization:

Store datasets in io.net volumes (faster access, no egress fees)
Delete volumes when not actively training (re-upload for next run if infrequent)
Use snapshot backups for long-term storage ($0.05/GB/month, 50% cheaper)

Projected Cost Savings

Example: 3-month LLaMA 70B training project

Resource	io.net Optimized	io.net Standard	AWS
Training (8x H100 SXM, 20 days)	$9,552 (spot)	$15,920	$78,643
Experimentation (4x A100, 60 days)	$8,006	$13,344	$42,336
Inference (8x RTX 4090, 90 days)	$8,467	$8,467	$34,560 (A100 equiv)
Storage (10TB, 90 days)	$3,072	$3,072	$7,680 (EBS)
Total	$29,097	$40,803	$163,219

Optimized io.net config saves $134,122 (82%) vs AWS.

Monitoring and Management

Production deployments require observability. io.net provides built-in monitoring for GPUs, costs, and workloads.

Real-Time GPU Utilization

Monitor GPU metrics in real-time:

io-cli monitor llama-training-cluster --gpu-stats --refresh 5

Output (updates every 5 seconds):

GPU  UTIL   MEM      TEMP   POWER   NVLINK
0    97%    74GB/80  68°C   685W    620 GB/s
1    96%    73GB/80  69°C   690W    615 GB/s
2    98%    75GB/80  67°C   680W    625 GB/s
...

Alerts: Set up alerts for anomalies:

io-cli monitor alert create \
  --cluster llama-training-cluster \
  --condition "gpu_util < 50 for 10min" \
  --action slack-webhook \
  --webhook https://hooks.slack.com/services/YOUR/WEBHOOK

Cost Tracking

Track spending in real-time:

io-cli billing usage --cluster llama-training-cluster

Output:

Cluster: llama-training-cluster
Runtime: 47h 23m
GPU hours: 379.1 (8 GPUs × 47.4h)
Cost: $944.37
Projected monthly: $14,320

Budget alerts:

io-cli billing alert create \
  --threshold 1000 \
  --period daily \
  --email [email protected]

Alert triggers if daily spending exceeds $1,000 (useful for catching runaway jobs).

Logs and Debugging

Access container logs:

io-cli logs llama-training-cluster --tail 100 --follow

Filter logs:

io-cli logs llama-training-cluster --grep "ERROR" --since 1h

Download logs for analysis:

io-cli logs llama-training-cluster --download logs.txt

For distributed training debugging, NCCL logs are crucial:

io-cli exec llama-training-cluster "cat /tmp/nccl_debug.log"

Look for NCCL errors like "Network unreachable" (indicates inter-node networking issue) or "Topology detection failed" (NVLink misconfiguration).

Common Errors and Troubleshooting

Error: "Insufficient GPU capacity in region us-east"

Cause: Temporary capacity constraint (rare on io.net, but possible).

Solutions:

Try different region:

io-cli deploy create --region us-west  # Or eu-west, asia-pacific

Wait for capacity (queue request):

io-cli deploy create --wait --timeout 3600  # Wait up to 1 hour

Use different GPU type:

io-cli deploy create --gpu-type a100-80gb  # More availability

io.net's decentralized model means capacity constraints are rare (200K+ GPUs across 200+ datacenters), unlike AWS where p5 instances are perpetually sold out.

Error: "NCCL initialization failed"

Cause: Multi-GPU distributed training can't establish communication between GPUs.

Common reasons:

Missing NVLink interconnect (for SXM GPUs):

# Verify you requested NVLink
io-cli deploy show llama-training-cluster | grep interconnect
# Should show: interconnect: nvlink

Fix: Redeploy with --interconnect nvlink flag.

Firewall blocking NCCL ports:

# Check if NCCL can bind to ports
io-cli exec llama-training-cluster "netstat -tuln | grep 29500"

Fix: Ensure security group allows inbound traffic on ports 29400-29600 (NCCL default range).

Wrong NCCL backend:

# Ensure using NCCL backend (not gloo or mpi)
dist.init_process_group(backend='nccl')  # Correct for NVIDIA GPUs

Debugging: Enable NCCL debug logging:

export NCCL_DEBUG=INFO
python train.py

Check logs for specific error (e.g., "Network xyz not found" indicates network interface naming issue).

Error: "CUDA out of memory (OOM)"

Cause: Model + optimizer state + activations exceed 80GB GPU memory.

Solutions:

Reduce batch size:

# Instead of batch_size=32
batch_size = 16  # Or 8, 4, etc.

Enable gradient checkpointing (trade compute for memory):

from torch.utils.checkpoint import checkpoint

class MyModel(nn.Module):
    def forward(self, x):
        return checkpoint(self.layer1, x)  # Recompute layer1 activations in backward pass

Reduces memory usage by 30-50% at cost of 20-30% slower training.

Use DeepSpeed ZeRO-3 (shards model across GPUs):

{
  "zero_optimization": {
    "stage": 3  # Shard params, gradients, optimizer states
  }
}

Enables training models 8x larger than single-GPU memory.

Use mixed precision training (FP16/BF16 instead of FP32):

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for batch in dataloader:
    with autocast():  # Use FP16 for forward pass
        loss = model(batch)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Reduces memory usage by 50% (FP16 is half the size of FP32).

Error: "SSH connection timeout"

Cause: Instance is still initializing (pulling Docker image, starting SSH server).

Solution: Wait 2-3 minutes and retry. First deployment with new Docker image takes longer (image download). Subsequent deployments with cached image are faster (<1 min).

Check status:

io-cli deploy status llama-training-cluster

Wait for status: Running and SSH: Ready before connecting.

If timeout persists after 5 minutes, check security groups:

io-cli deploy show llama-training-cluster --security-group

Ensure port 22 (SSH) is open for inbound traffic from your IP.

Error: "Deployment failed: Payment method declined"

Cause: Credit card declined or insufficient credits.

Solutions:

Check billing:

io-cli billing status

Add credits:

io-cli billing add-credits --amount 100  # Add $100

Update payment method (if card expired):

io-cli billing payment-method update

io.net requires minimum $10 credits for first deployment. Afterward, billing is automatic (charged after usage, not pre-paid).

Advanced Configurations

Custom Docker Images

Use your own Docker images with pre-installed dependencies:

Build custom image:

FROM nvcr.io/nvidia/pytorch:24.03-py3
RUN pip install transformers accelerate datasets
COPY ./my-training-code /workspace
WORKDIR /workspace

Push to registry:

docker build -t yourregistry.io/custom-pytorch:latest .
docker push yourregistry.io/custom-pytorch:latest

Deploy with custom image:

io-cli deploy create \
  --image yourregistry.io/custom-pytorch:latest \
  --gpu-type h100-pcie \
  --gpu-count 8

io.net supports:

Docker Hub (public and private with credentials)
NVIDIA NGC Registry
Google Container Registry (GCR)
Amazon ECR
Azure Container Registry
Self-hosted registries

Private registry authentication:

io-cli deploy create \
  --image yourregistry.io/private-image:latest \
  --registry-auth username:password

Persistent Storage

Attach persistent volumes for datasets and checkpoints:

Create volume:

io-cli volume create \
  --name training-data \
  --size 1TB \
  --region us-east

Attach to deployment:

io-cli deploy create \
  --gpu-type h100-sxm \
  --gpu-count 8 \
  --attach-volume training-data:/data \
  --name my-cluster

Volume is mounted at /data inside container. Data persists across deployments (stop/start cluster, data remains).

Upload data to volume:

# Option 1: Upload from local machine
io-cli volume upload training-data ./local-dataset/ /data/

# Option 2: Download from S3
io-cli exec my-cluster "aws s3 cp s3://my-bucket/dataset /data/ --recursive"

# Option 3: Use io.net's transfer service (faster for large datasets)
io-cli volume import training-data s3://my-bucket/dataset

Snapshots (for backups):

io-cli volume snapshot create training-data --name backup-2026-04-24

Restore from snapshot:

io-cli volume create --from-snapshot backup-2026-04-24 --name restored-data

Multi-Region Deployments

Deploy inference serving across multiple regions for low latency globally:

Configuration file (multi-region-inference.yaml):

deployments:
  - name: inference-us
    region: us-east
    gpu_type: rtx-4090
    gpu_count: 4
    image: myregistry/llm-serve:latest
  - name: inference-eu
    region: eu-west
    gpu_type: rtx-4090
    gpu_count: 4
    image: myregistry/llm-serve:latest
  - name: inference-asia
    region: asia-pacific
    gpu_type: rtx-4090
    gpu_count: 2
    image: myregistry/llm-serve:latest

load_balancer:
  enabled: true
  routing: geo  # Route users to nearest region
  health_check: /health
  fallback: us-east  # If region unavailable

Deploy:

io-cli deploy create --config multi-region-inference.yaml

io.net provisions instances in all three regions and configures geo-routing load balancer automatically.

Access endpoint:

https://multi-region-inference.io.net/v1/completions

Users in US hit inference-us, European users hit inference-eu, etc. Reduces latency by 100-300ms vs single-region deployment.

Production Best Practices

1. Use Configuration Files (Not CLI Args)

For reproducible deployments, store configs in Git:

cluster-config.yaml:

name: production-training
gpu_type: h100-sxm
gpu_count: 64
nodes: 8
gpus_per_node: 8
interconnect: nvlink
network: infiniband
image: myregistry/llm-training:v1.2.3
volumes:
  - name: datasets
    mount: /data
  - name: checkpoints
    mount: /checkpoints
env:
  WANDB_API_KEY: ${WANDB_API_KEY}
  HF_TOKEN: ${HF_TOKEN}
tags:
  team: research
  project: llama3-finetune
  cost-center: ml-training

Deploy:

io-cli deploy create --config cluster-config.yaml

Version control cluster-config.yaml — easy to reproduce deployments, audit changes, and roll back to previous configs.

2. Tag Resources for Cost Attribution

Attribute GPU costs to teams, projects, or customers:

io-cli deploy create \
  --tags team=research,project=llama3,env=prod \
  --gpu-type h100-sxm \
  --gpu-count 8

Cost report by tag:

io-cli billing usage --group-by team

Output:

Team       GPU Hours   Cost
research   1,247       $3,105
eng        892         $1,769
data-sci   456         $1,138

Essential for chargeback models (allocating cloud costs to internal teams/projects).

3. Set Budget Alerts

Prevent budget overruns:

io-cli billing alert create \
  --threshold 5000 \
  --period monthly \
  --action email \
  --email [email protected],[email protected]

Alert triggers if monthly spending exceeds $5,000. Adjust threshold based on budget.

Per-cluster budgets:

io-cli billing alert create \
  --cluster production-training \
  --threshold 500 \
  --period daily

4. Enable Auto-Scaling for Inference

Handle variable load without overpaying:

Auto-scaling config:

name: inference-cluster
gpu_type: rtx-4090
autoscaling:
  enabled: true
  min_gpus: 2
  max_gpus: 16
  target_utilization: 70%
  scale_up_threshold: 80%
  scale_down_threshold: 40%
  cooldown: 300  # Wait 5 min before scaling again

How it works:

If GPU utilization > 80% for 2 minutes → add GPUs (up to max_gpus)
If GPU utilization < 40% for 5 minutes → remove GPUs (down to min_gpus)
Ensures 70% average utilization (efficient cost vs latency tradeoff)

Cost impact:

Without auto-scaling: 16 GPUs × 24h × 30 days × $0.49 = $5,645/month
With auto-scaling (avg 6 GPUs): 6 × 24 × 30 × $0.49 = $2,116/month
Savings: $3,529 (62%)

5. Implement Health Checks

Ensure failed deployments are automatically replaced:

health_check:
  enabled: true
  endpoint: /health  # HTTP endpoint that returns 200 if healthy
  interval: 30  # Check every 30 seconds
  timeout: 5  # Fail if endpoint doesn't respond in 5s
  unhealthy_threshold: 3  # Mark unhealthy after 3 consecutive failures
  auto_replace: true  # Automatically replace unhealthy instances

If instance fails health check (GPU crash, CUDA error, OOM), io.net automatically terminates and replaces with new instance.

6. Use Canary Deployments for Updates

When updating model versions, avoid downtime with canary releases:

deployments:
  - name: inference-v1
    gpus: 8
    weight: 90  # 90% of traffic
    image: myregistry/model:v1.2
  - name: inference-v2
    gpus: 2
    weight: 10  # 10% of traffic (canary)
    image: myregistry/model:v1.3

load_balancer:
  enabled: true
  routing: weighted

Process:

Deploy v1.3 with 10% traffic weight (canary)
Monitor error rates, latency, quality metrics
If metrics look good, gradually increase v1.3 weight (10% → 25% → 50% → 100%)
Retire v1.2 once v1.3 is stable at 100%

Reduces risk of bad deployments taking down production.

Frequently Asked Questions

How long does deployment take?

Single GPU: 2-3 minutes (mostly Docker image pull time)

8-GPU cluster: 3-5 minutes (includes NVLink initialization)

64-GPU multi-node: 7-10 minutes (includes InfiniBand network setup)

Subsequent deployments with cached Docker images: 30-60 seconds.

io.net is 30-50x faster than AWS (p5 instance waitlists are 6-12 weeks).

Can I pause/resume clusters to save money?

Yes:

io-cli deploy pause my-cluster  # Stop billing immediately
io-cli deploy resume my-cluster  # Resume from exact state

Paused state:

No GPU charges (only storage charges for attached volumes)
All data in memory is lost (disk data persists)
Resume time: 60-90 seconds

Use case: Pause overnight (save 16 hours × $159/hour = $2,544/day for 64-GPU cluster).

What happens if a GPU fails mid-training?

io.net's fault tolerance:

Automatic detection: Health monitors detect GPU failure (CUDA error, hardware fault)
Notification: Alert sent to your configured webhook/email
Replacement: New GPU provisioned automatically (if auto_replace: true in config)
Checkpoint recovery: Resume training from last checkpoint

How to enable:

fault_tolerance:
  auto_replace: true
  checkpoint_interval: 3600  # Save checkpoint every hour
  checkpoint_path: /checkpoints

Best practice: Always enable checkpointing for long-running training jobs. Even without hardware failures, checkpointing protects against OOM errors, software bugs, and accidental termination.

Can I mix GPU types in one cluster?

Not recommended. Distributed training frameworks assume homogeneous hardware. Mixing creates:

Bottlenecks (slowest GPU becomes bottleneck)
Load imbalance (some GPUs finish before others → wasted compute)
Debugging complexity

Better approach: Run separate clusters for each workload:

H100 SXM cluster: Large model training
H100 PCIe cluster: Inference
A100 cluster: Experimentation

How do I transfer large datasets to the cluster?

Option 1: Upload from local machine (for <100GB):

io-cli volume upload my-volume ./local-data/ /data/

Option 2: Download from S3 (fastest for large datasets):

io-cli exec my-cluster "aws s3 cp s3://bucket/data /data/ --recursive"

No egress fees from S3 to io.net (unlike S3→AWS where egress is free).

Option 3: Use io.net's transfer service (for multi-TB datasets):

io-cli volume import my-volume s3://bucket/data --parallel 32

Parallelizes download across 32 threads — 10x faster than single-threaded aws s3 cp.

Option 4: Peer with existing cloud storage:

volumes:
  - type: s3
    bucket: my-bucket
    region: us-east-1
    mount: /data
    cache: true  # Cache in local SSD for faster access

Transparently mounts S3 bucket as filesystem. Data is fetched on-demand (lazy loading).

What's the minimum billing increment?

Per-second billing — you pay for exactly the time GPUs are running.

Example:

Start cluster at 10:00:00
Stop at 10:37:42
Billed for: 37 minutes 42 seconds = 0.628 hours
Cost: 8 GPUs × $2.49/hour × 0.628 = $12.51

AWS bills per-hour (minimum 1 hour). If you stop at 10:37:42, you pay for full hour ($19.92). io.net saves $7.41 (37%) on this session alone.

Can I use Kubernetes instead of CLI?

Yes. io.net supports Kubernetes deployments:

# Install io.net Kubernetes provider
io-cli k8s install

# Deploy cluster via kubectl
kubectl apply -f cluster-manifest.yaml

Sample Kubernetes manifest:

apiVersion: io.net/v1
kind: GPUCluster
metadata:
  name: training-cluster
spec:
  gpuType: h100-sxm
  gpuCount: 8
  image: pytorch/pytorch:2.2.0
  interconnect: nvlink

Kubernetes integration useful for teams already using K8s for orchestration.

How do I get support for cluster issues?

Support channels:

Documentation: https://docs.io.net
Community Discord: https://discord.gg/ionet (response time: <30 min)
Support tickets: [email protected] (response SLA: 4 hours for production issues)
Enterprise support: Dedicated Slack channel for enterprise customers

When filing support ticket, include:

Cluster ID (io-cli deploy show <name> --id)
Error logs (io-cli logs <name> --download)
Steps to reproduce
Expected vs actual behavior

Conclusion

io.net transforms GPU access from a months-long procurement nightmare into a 5-minute API call.

What you learned:

Deploy single GPUs and massive multi-node clusters instantly
Configure distributed training with PyTorch DDP and DeepSpeed
Optimize costs with spot instances, auto-shutdown, and right-sized GPU selection
Monitor and manage production workloads
Troubleshoot common deployment errors

io.net advantages over traditional cloud:

✅ Instant availability: No waitlists (vs 6-12 weeks on AWS)
✅ 70-80% cheaper: $2.49/hour for H100 (vs $12.29 on AWS)
✅ Per-second billing: Pay exactly for usage (vs per-hour on AWS)
✅ 200K+ GPUs: Scale from 1 to 1,000+ GPUs in minutes
✅ Decentralized: No vendor lock-in, no single point of failure

Next Steps

1. Create account: https://cloud.io.net/signup (2 minutes)

2. Deploy first GPU:

io-cli deploy create --gpu-type h100-pcie --gpu-count 1

3. Scale to production cluster:

io-cli deploy create --config cluster-config.yaml

4. Join community: https://discord.gg/ionet (ask questions, share learnings)

Get started now — deploy your first GPU cluster in under 5 minutes.