Ray has become the default framework for distributed machine learning. If you're training large models, running hyperparameter sweeps, or serving inference at scale, you're probably using Ray — or you should be.
The bottleneck isn't the framework. It's the infrastructure. Spinning up a Ray cluster on AWS means navigating instance types, availability zones, networking rules, and a bill that scales faster than your model's loss drops. A single 8xH100 instance on AWS costs $27.50/hr. Run a 4-node cluster for a weekend training job and you're looking at a $5,280 invoice before you've even evaluated the checkpoint.
io.net changes this equation. As a decentralized GPU cloud with 320,000+ GPUs across 130+ countries, io.net offers native Ray cluster support — not as an afterthought, but as a first-class deployment target. Clusters deploy in under 2 minutes, H100s start at $2.10/hr, and you can scale to hundreds of GPUs without calling a sales team.
This guide walks you through everything: what Ray is, why io.net is built for it, step-by-step deployment, real code examples for common workloads, and a detailed cost comparison against AWS and Anyscale.
What Is Ray? A Quick Primer
Ray is an open-source framework from Anyscale (originally UC Berkeley's RISELab) that makes it simple to scale Python workloads from a single machine to a cluster of hundreds. It's used in production by OpenAI, Uber, Spotify, Shopify, and Instacart.
Ray consists of four core libraries that cover the full ML lifecycle:
Ray Core — The foundation. A general-purpose distributed computing API that lets you parallelize any Python function with a single decorator. You write @ray.remote, and Ray handles scheduling, serialization, and fault recovery across machines.
import ray
@ray.remote
def train_shard(data_shard, model_config):
# This runs on any available GPU in the cluster
model = build_model(model_config)
return model.fit(data_shard)
# Launch 16 training shards in parallel
futures = [train_shard.remote(shard, config) for shard in data_shards]
results = ray.get(futures)
Ray Train — Distributed training for PyTorch, TensorFlow, and HuggingFace. Handles data parallelism, model parallelism, and mixed strategies. Integrates with DeepSpeed and FSDP out of the box.
Ray Tune — Hyperparameter tuning at scale. Run hundreds of trials in parallel across your cluster with state-of-the-art search algorithms (Bayesian optimization, HyperBand, PBT). One cluster, one command, every combination explored.
Ray Serve — Model serving with autoscaling. Deploy models as HTTP endpoints with dynamic batching, multi-model composition, and traffic splitting for A/B tests. Handles the transition from training to production on the same infrastructure.
Ray Data — Distributed data processing for ML. Load, transform, and feed data to training and inference pipelines without leaving the Ray ecosystem. Think of it as a bridge between your data lake and your GPU cluster.
Together, these libraries mean you can go from raw data to deployed model on a single Ray cluster — no separate Spark jobs, no Kubernetes YAML, no infrastructure handoffs between teams.
Why io.net for Ray Clusters
There are several places you can run Ray clusters in the cloud. Here's why io.net stands out for GPU-intensive Ray workloads.
Native Ray Support — Not Bolted On
io.net treats Ray as a first-class deployment type alongside Kubernetes, containers, and bare metal. When you select "Ray Cluster" in io.cloud, the platform handles:
- Ray head node initialization with the correct ports and dashboard configuration
- Worker node auto-discovery and registration to the head node
- GPU resource labeling so Ray's scheduler can place tasks on the right hardware
- Networking between nodes across the decentralized network
- Ray Dashboard exposure for monitoring
You don't need to SSH into machines and run ray start commands. You don't write Ansible playbooks. The cluster comes up ready to accept jobs.
Sub-2-Minute Cluster Deployment
On AWS, standing up a Ray cluster means launching EC2 instances, configuring security groups, installing Ray, setting environment variables, and connecting workers to the head node. With managed tools like ray up, this takes 8-15 minutes. With io.net, cluster deployment takes under 2 minutes — from clicking "Deploy" to having a Ray Dashboard URL in your browser.
This speed matters for iterative workflows. When you're experimenting with different cluster sizes or GPU types, waiting 15 minutes per reconfiguration kills your velocity. On io.net, you can tear down a cluster and spin up a different one in the time it takes to refill your coffee.
70% Cheaper Than Hyperscalers
io.net's decentralized model aggregates GPU supply from data centers, enterprise idle capacity, and compute providers across 130+ countries. This marketplace competition drives prices down:
| GPU | io.net | AWS | Savings |
|---|---|---|---|
| H100 SXM 80GB | $2.10-3.50/hr | $6.88/hr | 49-69% |
| A100 80GB | $1.20-2.00/hr | $5.12/hr* | 61-77% |
| RTX 4090 | $0.20-0.35/hr | N/A | — |
*AWS A100 pricing derived from p4d.24xlarge (8-GPU instance).
For Ray workloads, where you're often running multi-node clusters for hours or days, the savings compound fast. A 4-node H100 cluster for a 48-hour training run: ~$800 on io.net vs ~$2,640 on AWS.
Scale to Hundreds of GPUs
io.net's network spans 320,000+ GPUs. When you need to scale a Ray cluster from 8 GPUs to 64, or from 64 to 256, the supply is there. No waitlists, no instance quotas, no "capacity unavailable in your region" errors. The decentralized architecture means GPU availability is global by default.
Auto-Scaling and Fault Tolerance
io.net supports Ray's native autoscaler. Define minimum and maximum worker counts, and the cluster scales based on your workload's resource demands. If a worker node goes down — which is more common on decentralized infrastructure than in a single data center — Ray's built-in fault tolerance handles task re-execution on surviving nodes. Combined with proper checkpointing (covered below), your training jobs survive node failures without restarting from scratch.
Step-by-Step: Deploy a Ray Cluster on io.net
Here's the complete workflow from sign-up to submitting your first distributed job.
Step 1: Sign Up and Select Your GPUs
Create an account at cloud.io.net. Navigate to the GPU marketplace and select your hardware. For Ray clusters, consider your workload:
| Workload | Recommended GPU | Why |
|---|---|---|
| Large model training (>30B params) | H100 SXM 80GB | NVLink, highest memory bandwidth |
| Mid-size training (7-30B params) | A100 80GB | Best price-performance for most training |
| Fine-tuning & LoRA | A100 40GB or RTX 4090 | Sufficient VRAM, lowest cost |
| Inference serving | A100 or L40S | Good throughput per dollar |
| Hyperparameter tuning | Mix of A100 + RTX 4090 | Trials don't need top-tier GPUs |
Step 2: Configure Your Cluster
In io.cloud, select Ray Cluster as the deployment type. Configure the following:
Head Node:
- GPU: 1x A100 80GB (or match your worker GPUs)
- vCPUs: 16+
- RAM: 64GB+
- Role: Runs the Ray GCS (Global Control Store), autoscaler, and dashboard
Worker Nodes:
- GPU: Your selected training GPUs (e.g., 4x A100 80GB workers)
- vCPUs: 16+ per worker
- RAM: 64GB+ per worker
- Count: Start with your target, enable autoscaling for elasticity
Cluster Configuration (YAML preview):
cluster_name: distributed-training-cluster
head_node:
gpu_type: A100_80GB
gpu_count: 1
vcpus: 16
ram_gb: 64
disk_gb: 200
worker_nodes:
gpu_type: A100_80GB
gpu_count: 1 # GPUs per worker
vcpus: 16
ram_gb: 64
disk_gb: 200
min_workers: 4
max_workers: 8 # Autoscaling ceiling
ray_config:
ray_version: "2.44.0"
dashboard_port: 8265
object_store_memory: 20000000000 # 20GB
Step 3: Deploy via io.cloud Dashboard or CLI
Option A: Dashboard (recommended for first-time setup)
Click Deploy. io.net provisions the head node first, initializes Ray, then brings up worker nodes that auto-register with the head. You'll see each node's status transition from "provisioning" to "running" to "ray-connected" in the dashboard.
Option B: CLI
# Install the io.net CLI
pip install ionet-cli
# Authenticate
ionet auth login
# Deploy from config file
ionet cluster create --config cluster.yaml --type ray
# Check status
ionet cluster status distributed-training-cluster
Output:
Cluster: distributed-training-cluster
Status: RUNNING
Head Node: 10.0.1.100 (A100 80GB) — Ray GCS active
Workers: 4/4 connected
worker-0: 10.0.1.101 (A100 80GB) — ready
worker-1: 10.0.1.102 (A100 80GB) — ready
worker-2: 10.0.1.103 (A100 80GB) — ready
worker-3: 10.0.1.104 (A100 80GB) — ready
Ray Dashboard: https://your-cluster-id.ray.cloud.io.net:8265
Deploy Time: 1m 42s
Step 4: Connect to the Ray Dashboard
io.net exposes the Ray Dashboard via a secure URL provided after deployment. The dashboard gives you real-time visibility into:
- Cluster utilization: GPU and CPU usage per node
- Job status: Running, pending, and completed jobs
- Actor/task view: Which Ray tasks are executing on which nodes
- Logs: Centralized log aggregation across all workers
- Metrics: GPU memory, object store usage, task throughput
Access it directly in your browser. No port-forwarding or SSH tunnels needed — io.net handles the secure proxy.
Step 5: Submit Your First Job
Connect to the cluster from your local machine and submit a distributed job:
import ray
# Connect to the io.net Ray cluster
ray.init("ray://your-cluster-id.ray.cloud.io.net:10001")
# Verify cluster resources
print(ray.cluster_resources())
# {'CPU': 80.0, 'GPU': 5.0, 'memory': 343597383680, ...}
# Run a simple distributed task
@ray.remote(num_gpus=1)
def gpu_task(task_id):
import torch
device = torch.device("cuda")
# Create a tensor on GPU to verify access
x = torch.randn(1000, 1000, device=device)
result = torch.mm(x, x.T)
return f"Task {task_id} completed on {torch.cuda.get_device_name(0)}"
# Run 5 tasks in parallel — one on each GPU
futures = [gpu_task.remote(i) for i in range(5)]
results = ray.get(futures)
for r in results:
print(r)
# Task 0 completed on NVIDIA A100-SXM4-80GB
# Task 1 completed on NVIDIA A100-SXM4-80GB
# Task 2 completed on NVIDIA A100-SXM4-80GB
# Task 3 completed on NVIDIA A100-SXM4-80GB
# Task 4 completed on NVIDIA A100-SXM4-80GB
Alternatively, submit a job script via the Ray Jobs API:
# Submit a training job to the cluster
ray job submit \
--address "ray://your-cluster-id.ray.cloud.io.net:10001" \
--working-dir ./my_project \
--runtime-env-json '{"pip": ["torch>=2.2", "transformers", "datasets"]}' \
-- python train.py --epochs 10 --batch-size 64

Common Ray Workloads on io.net
Once your cluster is running, here are the workloads teams typically deploy — with production-ready code patterns for each.
Distributed Training with Ray Train + PyTorch
The most common Ray cluster use case. Ray Train wraps PyTorch's DistributedDataParallel (DDP) and handles process group initialization, gradient synchronization, and checkpoint management across nodes.
import ray
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig, RunConfig, CheckpointConfig
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
def train_loop_per_worker(config):
import ray.train as train
from ray.train.torch import prepare_model, prepare_data_loader
# Model — automatically wrapped with DDP
model = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True),
num_layers=12
)
model = prepare_model(model)
optimizer = torch.optim.AdamW(model.parameters(), lr=config["lr"])
# Data loader — automatically sharded across workers
dataset = load_training_data(config["data_path"])
dataloader = DataLoader(dataset, batch_size=config["batch_size"])
dataloader = prepare_data_loader(dataloader)
for epoch in range(config["epochs"]):
total_loss = 0
for batch in dataloader:
optimizer.zero_grad()
output = model(batch["input"])
loss = nn.functional.cross_entropy(output, batch["target"])
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(dataloader)
# Report metrics back to Ray
train.report(
{"loss": avg_loss, "epoch": epoch},
checkpoint=train.Checkpoint.from_directory("/tmp/checkpoint")
)
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
train_loop_config={
"lr": 1e-4,
"batch_size": 32,
"epochs": 50,
"data_path": "/data/training"
},
scaling_config=ScalingConfig(
num_workers=4, # Use all 4 GPU workers
use_gpu=True,
resources_per_worker={"GPU": 1}
),
run_config=RunConfig(
name="transformer-training",
checkpoint_config=CheckpointConfig(
num_to_keep=3, # Keep last 3 checkpoints
checkpoint_frequency=5 # Every 5 epochs
)
)
)
result = trainer.fit()
print(f"Final loss: {result.metrics['loss']:.4f}")
print(f"Best checkpoint: {result.best_checkpoints[0][0].path}")
Hyperparameter Tuning with Ray Tune
Run dozens or hundreds of training configurations in parallel. Ray Tune handles trial scheduling, early stopping of bad runs, and result aggregation.
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.optuna import OptunaSearch
search_space = {
"lr": tune.loguniform(1e-5, 1e-2),
"batch_size": tune.choice([16, 32, 64, 128]),
"num_layers": tune.randint(4, 24),
"d_model": tune.choice([256, 512, 768, 1024]),
"dropout": tune.uniform(0.0, 0.3),
"warmup_steps": tune.randint(100, 2000),
}
# ASHA scheduler kills underperforming trials early
scheduler = ASHAScheduler(
max_t=50, # Max epochs
grace_period=5, # Minimum epochs before early stopping
reduction_factor=3
)
tuner = tune.Tuner(
tune.with_resources(train_model, {"gpu": 1}),
param_space=search_space,
tune_config=tune.TuneConfig(
metric="val_loss",
mode="min",
num_samples=64, # 64 total trials
scheduler=scheduler,
search_alg=OptunaSearch(), # Bayesian optimization
max_concurrent_trials=5 # Use all 5 GPUs simultaneously
),
run_config=tune.RunConfig(
name="hparam-sweep",
storage_path="/results/tune"
)
)
results = tuner.fit()
best = results.get_best_result()
print(f"Best config: {best.config}")
print(f"Best val_loss: {best.metrics['val_loss']:.4f}")
Model Serving with Ray Serve
Deploy trained models as auto-scaling HTTP endpoints. Ray Serve handles batching, request queuing, and replica management.
from ray import serve
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
@serve.deployment(
ray_actor_options={"num_gpus": 1},
autoscaling_config={
"min_replicas": 1,
"max_replicas": 4, # Scale up to 4 GPUs under load
"target_ongoing_requests": 5,
},
max_ongoing_requests=10,
)
class LLMDeployment:
def __init__(self, model_name: str):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="cuda"
)
self.model.eval()
async def __call__(self, request) -> dict:
data = await request.json()
prompt = data["prompt"]
max_tokens = data.get("max_tokens", 256)
inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=data.get("temperature", 0.7),
do_sample=True
)
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
return {"generated_text": response}
# Deploy
app = LLMDeployment.bind(model_name="meta-llama/Llama-3.1-8B-Instruct")
serve.run(app, host="0.0.0.0", port=8000)
Data Processing with Ray Data
Preprocess large datasets in parallel across your cluster before feeding them into training.
import ray
# Read a large dataset — Ray handles partitioning across nodes
ds = ray.data.read_parquet("s3://my-bucket/training-data/")
# Parallel preprocessing on GPUs
def tokenize_batch(batch):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokens = tokenizer(
batch["text"].tolist(),
padding="max_length",
truncation=True,
max_length=2048,
return_tensors="np"
)
batch["input_ids"] = tokens["input_ids"]
batch["attention_mask"] = tokens["attention_mask"]
return batch
# Process with GPU acceleration
processed = ds.map_batches(
tokenize_batch,
batch_size=256,
num_gpus=1, # Each batch uses 1 GPU
batch_format="pandas"
)
# Write processed data back — or pipe directly to Ray Train
processed.write_parquet("/data/tokenized/")
print(f"Processed {processed.count()} examples across the cluster")
Scaling Best Practices for Ray on io.net
Running Ray on a decentralized GPU cloud requires some adjustments compared to a single data center. These practices ensure reliability and performance.
Head Node vs Worker Node Sizing
The head node runs Ray's control plane: the Global Control Store (GCS), the autoscaler, the dashboard, and job scheduling. It doesn't need the heaviest GPU, but it needs reliable compute.
Recommended head node specs:
- 1x GPU (for dashboard metrics and light tasks)
- 16+ vCPUs (GCS and scheduler are CPU-bound)
- 64GB+ RAM (object store metadata)
- 200GB+ disk (logs, checkpoints, dashboard data)
Worker nodes are simpler: maximize GPU power, keep CPU/RAM sufficient for data loading (16+ vCPUs, 64GB+ RAM per GPU). If you're doing data-parallel training, every worker should have identical GPU hardware to avoid stragglers.
Checkpointing for Fault Tolerance
On decentralized infrastructure, node availability is probabilistic rather than guaranteed. Design for interruption:
from ray.train import RunConfig, CheckpointConfig
run_config = RunConfig(
checkpoint_config=CheckpointConfig(
# Save frequently — every 2 epochs or every N steps
checkpoint_frequency=2,
# Keep multiple checkpoints in case one corrupts
num_to_keep=5,
),
# Use persistent storage, not local disk
storage_path="s3://my-bucket/ray-checkpoints/",
# Enable failure handling
failure_config=ray.train.FailureConfig(
max_failures=3, # Auto-restart up to 3 times on worker loss
),
)
Key principles:
- Checkpoint to remote storage (S3, GCS, or a persistent volume), not to the worker's local disk. If the worker goes down, local checkpoints go with it.
- Checkpoint frequently. On hyperscaler instances, you might checkpoint every 10 epochs. On decentralized infra, checkpoint every 1-2 epochs or every 500 steps.
- Set
max_failures>= 2. This lets Ray Train automatically recover from a worker node loss, provision a replacement, and resume from the latest checkpoint.
Autoscaling Configuration
io.net supports Ray's built-in autoscaler. Configure it to match your workload pattern:
# Autoscaling policy in cluster config
autoscaling:
enabled: true
min_workers: 2 # Always keep 2 workers warm
max_workers: 16 # Scale up to 16 under heavy load
idle_timeout_minutes: 10 # Scale down after 10 min idle
upscaling_speed: 2.0 # Add up to 2 workers at a time
# Resource thresholds
target_utilization: 0.8 # Scale up when GPU usage > 80%
For hyperparameter tuning, set max_workers equal to your max concurrent trials. For training, keep a fixed worker count (autoscaling during distributed training introduces complexity with gradient synchronization). For serving, autoscaling is essential — set thresholds based on request latency and queue depth.
Cost Comparison: Ray on io.net vs AWS vs Anyscale
Here's what a real Ray cluster workload costs across three platforms. We'll model two common scenarios.
Scenario 1: 4-Node Training Cluster (48 hours)
4 workers with A100 80GB each, 1 head node.
| Component | io.net | AWS (p4d) | Anyscale |
|---|---|---|---|
| Head node (A100 80GB) | $1.40/hr | $5.12/hr* | $3.50/hr |
| Worker nodes (4x A100 80GB) | $5.60/hr | $20.48/hr* | $14.00/hr |
| Total compute/hr | $7.00/hr | $25.60/hr | $17.50/hr |
| 48-hour total | $336 | $1,229 | $840 |
| Data egress (500GB) | $0 | $45 | Included |
| Final bill | $336 | $1,274 | $840 |
*AWS per-GPU cost derived from p4d.24xlarge ($40.96/hr for 8 A100s = $5.12/GPU/hr).
io.net savings: 74% vs AWS, 60% vs Anyscale.
Scenario 2: Hyperparameter Tuning Sweep (8 hours, 32 trials)
8 GPUs running 32 trials with ASHA early stopping. Average trial runs ~3 hours due to early termination.
| Component | io.net | AWS | Anyscale |
|---|---|---|---|
| 8x A100 80GB x 8 hours | $76.80 | $327.68 | $224.00 |
| Effective cost (ASHA kills 60% early) | ~$45 | ~$195 | ~$134 |
| Estimated bill | ~$45 | ~$195 | ~$134 |
For teams running weekly tuning sweeps, that's $600/month on io.net vs $2,600/month on AWS.
Scenario 3: Ray Serve Inference (Monthly)
2 replicas with autoscaling to 8 replicas, average 4 active. A100 80GB.
| Component | io.net | AWS | Anyscale |
|---|---|---|---|
| Average hourly cost (4 GPUs) | $5.60 | $20.48 | $14.00 |
| Monthly (730 hours) | $4,088 | $14,950 | $10,220 |
For always-on serving, the monthly savings approach five figures.
Troubleshooting
Workers not connecting to head node
Symptom: Worker nodes show "running" in io.net dashboard but don't appear in Ray Dashboard.
Fix: Verify the Ray port (default 6379) is accessible between nodes. In io.cloud, check that the cluster networking is set to "internal mesh" mode. If workers were added after initial deployment, they may need the updated head node address:
# On the worker (if SSH access is available)
ray stop
ray start --address="HEAD_NODE_IP:6379"
GPU not visible to Ray
Symptom: ray.cluster_resources() shows GPU: 0 even though GPUs are provisioned.
Fix: Ensure CUDA drivers are loaded. Ray detects GPUs at startup. If you installed CUDA after Ray started, restart Ray on the affected nodes:
# Quick check from a Ray task
@ray.remote(num_gpus=1)
def check_gpu():
import torch
return torch.cuda.is_available(), torch.cuda.get_device_name(0)
ray.get(check_gpu.remote())
Out of object store memory
Symptom: RayOutOfMemoryError or tasks stuck in pending state.
Fix: Increase the object store allocation in your cluster config. By default, Ray uses 30% of system memory. For data-heavy workloads:
ray.init(
_system_config={"object_store_memory": 40_000_000_000} # 40GB
)
Also check for memory leaks — large objects returned from tasks are stored in the object store. Use ray.internal.free(object_ref) to release references you no longer need.
Training job hangs after worker failure
Symptom: Distributed training freezes when one worker drops.
Fix: Enable elastic training with failure handling:
from ray.train import FailureConfig
run_config = RunConfig(
failure_config=FailureConfig(max_failures=3)
)
And ensure you're checkpointing to remote storage (not local disk) so the replacement worker can resume from the latest state.
Slow data loading across nodes
Symptom: GPU utilization is low, workers waiting for data.
Fix: Use Ray Data for distributed data loading instead of a single-node DataLoader. Alternatively, stage your data to each worker's local disk before training:
# Pre-stage data to local disk on each worker
@ray.remote(num_gpus=1)
def stage_data_and_train(data_url, config):
import subprocess
subprocess.run(["aws", "s3", "sync", data_url, "/local/data/"])
return run_training("/local/data/", config)
Frequently Asked Questions
What Ray version does io.net support?
io.net supports Ray 2.x (currently 2.44.0 as of early 2026). The platform pre-installs Ray on cluster nodes, and you can specify your target version in the cluster configuration. Custom versions can be set via the ray_version field in your cluster YAML.
Can I use Ray with PyTorch FSDP or DeepSpeed on io.net?
Yes. Ray Train integrates natively with both PyTorch Fully Sharded Data Parallel (FSDP) and DeepSpeed. Since io.net provisions standard NVIDIA GPU instances with CUDA and NCCL, all distributed training strategies work as expected. Specify your strategy in the TorchTrainer configuration:
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig
trainer = TorchTrainer(
train_loop_per_worker=train_func,
scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
torch_config=ray.train.torch.TorchConfig(
backend="nccl",
fsdp_config={"sharding_strategy": "FULL_SHARD"}
)
)
How do I persist checkpoints and results?
Use remote storage (S3-compatible or GCS) as your checkpoint destination. io.net worker local disks are ephemeral. Configure your RunConfig with a remote storage_path:
run_config = RunConfig(
storage_path="s3://my-bucket/ray-results/",
checkpoint_config=CheckpointConfig(num_to_keep=5)
)
This ensures checkpoints survive node restarts and cluster teardowns.
Is there GPU-to-GPU interconnect (NVLink) on io.net?
Multi-GPU nodes on io.net that use H100 SXM or A100 SXM hardware have NVLink within the node, just like any other cloud provider. Cross-node communication uses standard networking (TCP/RDMA where available). For workloads requiring heavy all-reduce operations across nodes, the cross-node bandwidth is the bottleneck regardless of provider — focus on minimizing communication with gradient compression or large batch sizes.
Can I mix GPU types in a single Ray cluster?
Ray supports heterogeneous clusters, and you can configure this on io.net by specifying different GPU types for different worker groups. This is useful for workloads like hyperparameter tuning (where individual trials don't need top-tier GPUs) or pipelines where preprocessing runs on cheaper GPUs and training runs on H100s:
@ray.remote(num_gpus=1, resources={"A100": 1})
def training_task(data):
...
@ray.remote(num_gpus=1, resources={"RTX4090": 1})
def preprocessing_task(raw_data):
...
How does billing work for autoscaling clusters?
io.net bills per-minute per node. When the autoscaler adds workers, billing starts when the node is provisioned. When idle workers are removed after the idle_timeout, billing stops. You only pay for compute you're using. There are no minimum commitments or reservation fees for on-demand clusters.
Conclusion
Ray is the framework that makes distributed ML practical. io.net is the infrastructure that makes it affordable.
Setting up a Ray cluster on io.net takes under 2 minutes, costs 60-75% less than AWS, and gives you access to 320,000+ GPUs without capacity constraints. Whether you're running distributed training with Ray Train, sweeping hyperparameters with Ray Tune, deploying models with Ray Serve, or processing data with Ray Data, the workflow is the same: configure, deploy, connect, submit.
The combination of Ray's mature distributed computing abstractions and io.net's decentralized GPU supply means you can run ML workloads at scales that would be prohibitively expensive on traditional cloud providers. Stop waiting for GPU quota approvals. Stop overpaying for infrastructure that sits idle between training runs.
Start building on io.net:
- Create your io.net account
- Deploy a Ray cluster in under 2 minutes
- Submit your first distributed training job
Your models won't train themselves — but they shouldn't bankrupt you either.