Ray has become the default framework for distributed machine learning. If you're training large models, running hyperparameter sweeps, or serving inference at scale, you're probably using Ray — or you should be.

The bottleneck isn't the framework. It's the infrastructure. Spinning up a Ray cluster on AWS means navigating instance types, availability zones, networking rules, and a bill that scales faster than your model's loss drops. A single 8xH100 instance on AWS costs $27.50/hr. Run a 4-node cluster for a weekend training job and you're looking at a $5,280 invoice before you've even evaluated the checkpoint.

io.net changes this equation. As a decentralized GPU cloud with 320,000+ GPUs across 130+ countries, io.net offers native Ray cluster support — not as an afterthought, but as a first-class deployment target. Clusters deploy in under 2 minutes, H100s start at $2.10/hr, and you can scale to hundreds of GPUs without calling a sales team.

This guide walks you through everything: what Ray is, why io.net is built for it, step-by-step deployment, real code examples for common workloads, and a detailed cost comparison against AWS and Anyscale.

What Is Ray? A Quick Primer

Ray is an open-source framework from Anyscale (originally UC Berkeley's RISELab) that makes it simple to scale Python workloads from a single machine to a cluster of hundreds. It's used in production by OpenAI, Uber, Spotify, Shopify, and Instacart.

Ray consists of four core libraries that cover the full ML lifecycle:

Ray Core — The foundation. A general-purpose distributed computing API that lets you parallelize any Python function with a single decorator. You write @ray.remote, and Ray handles scheduling, serialization, and fault recovery across machines.

import ray

@ray.remote
def train_shard(data_shard, model_config):
    # This runs on any available GPU in the cluster
    model = build_model(model_config)
    return model.fit(data_shard)

# Launch 16 training shards in parallel
futures = [train_shard.remote(shard, config) for shard in data_shards]
results = ray.get(futures)

Ray Train — Distributed training for PyTorch, TensorFlow, and HuggingFace. Handles data parallelism, model parallelism, and mixed strategies. Integrates with DeepSpeed and FSDP out of the box.

Ray Tune — Hyperparameter tuning at scale. Run hundreds of trials in parallel across your cluster with state-of-the-art search algorithms (Bayesian optimization, HyperBand, PBT). One cluster, one command, every combination explored.

Ray Serve — Model serving with autoscaling. Deploy models as HTTP endpoints with dynamic batching, multi-model composition, and traffic splitting for A/B tests. Handles the transition from training to production on the same infrastructure.

Ray Data — Distributed data processing for ML. Load, transform, and feed data to training and inference pipelines without leaving the Ray ecosystem. Think of it as a bridge between your data lake and your GPU cluster.

Together, these libraries mean you can go from raw data to deployed model on a single Ray cluster — no separate Spark jobs, no Kubernetes YAML, no infrastructure handoffs between teams.

Why io.net for Ray Clusters

There are several places you can run Ray clusters in the cloud. Here's why io.net stands out for GPU-intensive Ray workloads.

Native Ray Support — Not Bolted On

io.net treats Ray as a first-class deployment type alongside Kubernetes, containers, and bare metal. When you select "Ray Cluster" in io.cloud, the platform handles:

  • Ray head node initialization with the correct ports and dashboard configuration
  • Worker node auto-discovery and registration to the head node
  • GPU resource labeling so Ray's scheduler can place tasks on the right hardware
  • Networking between nodes across the decentralized network
  • Ray Dashboard exposure for monitoring

You don't need to SSH into machines and run ray start commands. You don't write Ansible playbooks. The cluster comes up ready to accept jobs.

Sub-2-Minute Cluster Deployment

On AWS, standing up a Ray cluster means launching EC2 instances, configuring security groups, installing Ray, setting environment variables, and connecting workers to the head node. With managed tools like ray up, this takes 8-15 minutes. With io.net, cluster deployment takes under 2 minutes — from clicking "Deploy" to having a Ray Dashboard URL in your browser.

This speed matters for iterative workflows. When you're experimenting with different cluster sizes or GPU types, waiting 15 minutes per reconfiguration kills your velocity. On io.net, you can tear down a cluster and spin up a different one in the time it takes to refill your coffee.

70% Cheaper Than Hyperscalers

io.net's decentralized model aggregates GPU supply from data centers, enterprise idle capacity, and compute providers across 130+ countries. This marketplace competition drives prices down:

GPUio.netAWSSavings
H100 SXM 80GB$2.10-3.50/hr$6.88/hr49-69%
A100 80GB$1.20-2.00/hr$5.12/hr*61-77%
RTX 4090$0.20-0.35/hrN/A

*AWS A100 pricing derived from p4d.24xlarge (8-GPU instance).

For Ray workloads, where you're often running multi-node clusters for hours or days, the savings compound fast. A 4-node H100 cluster for a 48-hour training run: ~$800 on io.net vs ~$2,640 on AWS.

Scale to Hundreds of GPUs

io.net's network spans 320,000+ GPUs. When you need to scale a Ray cluster from 8 GPUs to 64, or from 64 to 256, the supply is there. No waitlists, no instance quotas, no "capacity unavailable in your region" errors. The decentralized architecture means GPU availability is global by default.

Auto-Scaling and Fault Tolerance

io.net supports Ray's native autoscaler. Define minimum and maximum worker counts, and the cluster scales based on your workload's resource demands. If a worker node goes down — which is more common on decentralized infrastructure than in a single data center — Ray's built-in fault tolerance handles task re-execution on surviving nodes. Combined with proper checkpointing (covered below), your training jobs survive node failures without restarting from scratch.

Step-by-Step: Deploy a Ray Cluster on io.net

Here's the complete workflow from sign-up to submitting your first distributed job.

Step 1: Sign Up and Select Your GPUs

Create an account at cloud.io.net. Navigate to the GPU marketplace and select your hardware. For Ray clusters, consider your workload:

WorkloadRecommended GPUWhy
Large model training (>30B params)H100 SXM 80GBNVLink, highest memory bandwidth
Mid-size training (7-30B params)A100 80GBBest price-performance for most training
Fine-tuning & LoRAA100 40GB or RTX 4090Sufficient VRAM, lowest cost
Inference servingA100 or L40SGood throughput per dollar
Hyperparameter tuningMix of A100 + RTX 4090Trials don't need top-tier GPUs

Step 2: Configure Your Cluster

In io.cloud, select Ray Cluster as the deployment type. Configure the following:

Head Node:

  • GPU: 1x A100 80GB (or match your worker GPUs)
  • vCPUs: 16+
  • RAM: 64GB+
  • Role: Runs the Ray GCS (Global Control Store), autoscaler, and dashboard

Worker Nodes:

  • GPU: Your selected training GPUs (e.g., 4x A100 80GB workers)
  • vCPUs: 16+ per worker
  • RAM: 64GB+ per worker
  • Count: Start with your target, enable autoscaling for elasticity

Cluster Configuration (YAML preview):

cluster_name: distributed-training-cluster

head_node:
  gpu_type: A100_80GB
  gpu_count: 1
  vcpus: 16
  ram_gb: 64
  disk_gb: 200

worker_nodes:
  gpu_type: A100_80GB
  gpu_count: 1       # GPUs per worker
  vcpus: 16
  ram_gb: 64
  disk_gb: 200
  min_workers: 4
  max_workers: 8      # Autoscaling ceiling

ray_config:
  ray_version: "2.44.0"
  dashboard_port: 8265
  object_store_memory: 20000000000  # 20GB

Step 3: Deploy via io.cloud Dashboard or CLI

Option A: Dashboard (recommended for first-time setup)

Click Deploy. io.net provisions the head node first, initializes Ray, then brings up worker nodes that auto-register with the head. You'll see each node's status transition from "provisioning" to "running" to "ray-connected" in the dashboard.

Option B: CLI

# Install the io.net CLI
pip install ionet-cli

# Authenticate
ionet auth login

# Deploy from config file
ionet cluster create --config cluster.yaml --type ray

# Check status
ionet cluster status distributed-training-cluster

Output:

Cluster: distributed-training-cluster
Status: RUNNING
Head Node: 10.0.1.100 (A100 80GB) — Ray GCS active
Workers: 4/4 connected
  worker-0: 10.0.1.101 (A100 80GB) — ready
  worker-1: 10.0.1.102 (A100 80GB) — ready
  worker-2: 10.0.1.103 (A100 80GB) — ready
  worker-3: 10.0.1.104 (A100 80GB) — ready
Ray Dashboard: https://your-cluster-id.ray.cloud.io.net:8265
Deploy Time: 1m 42s

Step 4: Connect to the Ray Dashboard

io.net exposes the Ray Dashboard via a secure URL provided after deployment. The dashboard gives you real-time visibility into:

  • Cluster utilization: GPU and CPU usage per node
  • Job status: Running, pending, and completed jobs
  • Actor/task view: Which Ray tasks are executing on which nodes
  • Logs: Centralized log aggregation across all workers
  • Metrics: GPU memory, object store usage, task throughput

Access it directly in your browser. No port-forwarding or SSH tunnels needed — io.net handles the secure proxy.

Step 5: Submit Your First Job

Connect to the cluster from your local machine and submit a distributed job:

import ray

# Connect to the io.net Ray cluster
ray.init("ray://your-cluster-id.ray.cloud.io.net:10001")

# Verify cluster resources
print(ray.cluster_resources())
# {'CPU': 80.0, 'GPU': 5.0, 'memory': 343597383680, ...}

# Run a simple distributed task
@ray.remote(num_gpus=1)
def gpu_task(task_id):
    import torch
    device = torch.device("cuda")
    # Create a tensor on GPU to verify access
    x = torch.randn(1000, 1000, device=device)
    result = torch.mm(x, x.T)
    return f"Task {task_id} completed on {torch.cuda.get_device_name(0)}"

# Run 5 tasks in parallel — one on each GPU
futures = [gpu_task.remote(i) for i in range(5)]
results = ray.get(futures)

for r in results:
    print(r)
# Task 0 completed on NVIDIA A100-SXM4-80GB
# Task 1 completed on NVIDIA A100-SXM4-80GB
# Task 2 completed on NVIDIA A100-SXM4-80GB
# Task 3 completed on NVIDIA A100-SXM4-80GB
# Task 4 completed on NVIDIA A100-SXM4-80GB

Alternatively, submit a job script via the Ray Jobs API:

# Submit a training job to the cluster
ray job submit \
  --address "ray://your-cluster-id.ray.cloud.io.net:10001" \
  --working-dir ./my_project \
  --runtime-env-json '{"pip": ["torch>=2.2", "transformers", "datasets"]}' \
  -- python train.py --epochs 10 --batch-size 64

Common Ray Workloads on io.net

Once your cluster is running, here are the workloads teams typically deploy — with production-ready code patterns for each.

Distributed Training with Ray Train + PyTorch

The most common Ray cluster use case. Ray Train wraps PyTorch's DistributedDataParallel (DDP) and handles process group initialization, gradient synchronization, and checkpoint management across nodes.

import ray
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig, RunConfig, CheckpointConfig
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

def train_loop_per_worker(config):
    import ray.train as train
    from ray.train.torch import prepare_model, prepare_data_loader

    # Model — automatically wrapped with DDP
    model = nn.TransformerEncoder(
        nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True),
        num_layers=12
    )
    model = prepare_model(model)

    optimizer = torch.optim.AdamW(model.parameters(), lr=config["lr"])

    # Data loader — automatically sharded across workers
    dataset = load_training_data(config["data_path"])
    dataloader = DataLoader(dataset, batch_size=config["batch_size"])
    dataloader = prepare_data_loader(dataloader)

    for epoch in range(config["epochs"]):
        total_loss = 0
        for batch in dataloader:
            optimizer.zero_grad()
            output = model(batch["input"])
            loss = nn.functional.cross_entropy(output, batch["target"])
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        # Report metrics back to Ray
        train.report(
            {"loss": avg_loss, "epoch": epoch},
            checkpoint=train.Checkpoint.from_directory("/tmp/checkpoint")
        )

trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    train_loop_config={
        "lr": 1e-4,
        "batch_size": 32,
        "epochs": 50,
        "data_path": "/data/training"
    },
    scaling_config=ScalingConfig(
        num_workers=4,           # Use all 4 GPU workers
        use_gpu=True,
        resources_per_worker={"GPU": 1}
    ),
    run_config=RunConfig(
        name="transformer-training",
        checkpoint_config=CheckpointConfig(
            num_to_keep=3,       # Keep last 3 checkpoints
            checkpoint_frequency=5  # Every 5 epochs
        )
    )
)

result = trainer.fit()
print(f"Final loss: {result.metrics['loss']:.4f}")
print(f"Best checkpoint: {result.best_checkpoints[0][0].path}")

Hyperparameter Tuning with Ray Tune

Run dozens or hundreds of training configurations in parallel. Ray Tune handles trial scheduling, early stopping of bad runs, and result aggregation.

from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.optuna import OptunaSearch

search_space = {
    "lr": tune.loguniform(1e-5, 1e-2),
    "batch_size": tune.choice([16, 32, 64, 128]),
    "num_layers": tune.randint(4, 24),
    "d_model": tune.choice([256, 512, 768, 1024]),
    "dropout": tune.uniform(0.0, 0.3),
    "warmup_steps": tune.randint(100, 2000),
}

# ASHA scheduler kills underperforming trials early
scheduler = ASHAScheduler(
    max_t=50,          # Max epochs
    grace_period=5,    # Minimum epochs before early stopping
    reduction_factor=3
)

tuner = tune.Tuner(
    tune.with_resources(train_model, {"gpu": 1}),
    param_space=search_space,
    tune_config=tune.TuneConfig(
        metric="val_loss",
        mode="min",
        num_samples=64,             # 64 total trials
        scheduler=scheduler,
        search_alg=OptunaSearch(),   # Bayesian optimization
        max_concurrent_trials=5     # Use all 5 GPUs simultaneously
    ),
    run_config=tune.RunConfig(
        name="hparam-sweep",
        storage_path="/results/tune"
    )
)

results = tuner.fit()
best = results.get_best_result()
print(f"Best config: {best.config}")
print(f"Best val_loss: {best.metrics['val_loss']:.4f}")

Model Serving with Ray Serve

Deploy trained models as auto-scaling HTTP endpoints. Ray Serve handles batching, request queuing, and replica management.

from ray import serve
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

@serve.deployment(
    ray_actor_options={"num_gpus": 1},
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 4,    # Scale up to 4 GPUs under load
        "target_ongoing_requests": 5,
    },
    max_ongoing_requests=10,
)
class LLMDeployment:
    def __init__(self, model_name: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="cuda"
        )
        self.model.eval()

    async def __call__(self, request) -> dict:
        data = await request.json()
        prompt = data["prompt"]
        max_tokens = data.get("max_tokens", 256)

        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=data.get("temperature", 0.7),
                do_sample=True
            )
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return {"generated_text": response}

# Deploy
app = LLMDeployment.bind(model_name="meta-llama/Llama-3.1-8B-Instruct")
serve.run(app, host="0.0.0.0", port=8000)

Data Processing with Ray Data

Preprocess large datasets in parallel across your cluster before feeding them into training.

import ray

# Read a large dataset — Ray handles partitioning across nodes
ds = ray.data.read_parquet("s3://my-bucket/training-data/")

# Parallel preprocessing on GPUs
def tokenize_batch(batch):
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
    tokens = tokenizer(
        batch["text"].tolist(),
        padding="max_length",
        truncation=True,
        max_length=2048,
        return_tensors="np"
    )
    batch["input_ids"] = tokens["input_ids"]
    batch["attention_mask"] = tokens["attention_mask"]
    return batch

# Process with GPU acceleration
processed = ds.map_batches(
    tokenize_batch,
    batch_size=256,
    num_gpus=1,           # Each batch uses 1 GPU
    batch_format="pandas"
)

# Write processed data back — or pipe directly to Ray Train
processed.write_parquet("/data/tokenized/")
print(f"Processed {processed.count()} examples across the cluster")

Scaling Best Practices for Ray on io.net

Running Ray on a decentralized GPU cloud requires some adjustments compared to a single data center. These practices ensure reliability and performance.

Head Node vs Worker Node Sizing

The head node runs Ray's control plane: the Global Control Store (GCS), the autoscaler, the dashboard, and job scheduling. It doesn't need the heaviest GPU, but it needs reliable compute.

Recommended head node specs:

  • 1x GPU (for dashboard metrics and light tasks)
  • 16+ vCPUs (GCS and scheduler are CPU-bound)
  • 64GB+ RAM (object store metadata)
  • 200GB+ disk (logs, checkpoints, dashboard data)

Worker nodes are simpler: maximize GPU power, keep CPU/RAM sufficient for data loading (16+ vCPUs, 64GB+ RAM per GPU). If you're doing data-parallel training, every worker should have identical GPU hardware to avoid stragglers.

Checkpointing for Fault Tolerance

On decentralized infrastructure, node availability is probabilistic rather than guaranteed. Design for interruption:

from ray.train import RunConfig, CheckpointConfig

run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        # Save frequently — every 2 epochs or every N steps
        checkpoint_frequency=2,
        # Keep multiple checkpoints in case one corrupts
        num_to_keep=5,
    ),
    # Use persistent storage, not local disk
    storage_path="s3://my-bucket/ray-checkpoints/",
    # Enable failure handling
    failure_config=ray.train.FailureConfig(
        max_failures=3,  # Auto-restart up to 3 times on worker loss
    ),
)

Key principles:

  • Checkpoint to remote storage (S3, GCS, or a persistent volume), not to the worker's local disk. If the worker goes down, local checkpoints go with it.
  • Checkpoint frequently. On hyperscaler instances, you might checkpoint every 10 epochs. On decentralized infra, checkpoint every 1-2 epochs or every 500 steps.
  • Set max_failures >= 2. This lets Ray Train automatically recover from a worker node loss, provision a replacement, and resume from the latest checkpoint.

Autoscaling Configuration

io.net supports Ray's built-in autoscaler. Configure it to match your workload pattern:

# Autoscaling policy in cluster config
autoscaling:
  enabled: true
  min_workers: 2           # Always keep 2 workers warm
  max_workers: 16          # Scale up to 16 under heavy load
  idle_timeout_minutes: 10 # Scale down after 10 min idle
  upscaling_speed: 2.0     # Add up to 2 workers at a time

  # Resource thresholds
  target_utilization: 0.8  # Scale up when GPU usage > 80%

For hyperparameter tuning, set max_workers equal to your max concurrent trials. For training, keep a fixed worker count (autoscaling during distributed training introduces complexity with gradient synchronization). For serving, autoscaling is essential — set thresholds based on request latency and queue depth.

Cost Comparison: Ray on io.net vs AWS vs Anyscale

Here's what a real Ray cluster workload costs across three platforms. We'll model two common scenarios.

Scenario 1: 4-Node Training Cluster (48 hours)

4 workers with A100 80GB each, 1 head node.

Componentio.netAWS (p4d)Anyscale
Head node (A100 80GB)$1.40/hr$5.12/hr*$3.50/hr
Worker nodes (4x A100 80GB)$5.60/hr$20.48/hr*$14.00/hr
Total compute/hr$7.00/hr$25.60/hr$17.50/hr
48-hour total$336$1,229$840
Data egress (500GB)$0$45Included
Final bill$336$1,274$840

*AWS per-GPU cost derived from p4d.24xlarge ($40.96/hr for 8 A100s = $5.12/GPU/hr).

io.net savings: 74% vs AWS, 60% vs Anyscale.

Scenario 2: Hyperparameter Tuning Sweep (8 hours, 32 trials)

8 GPUs running 32 trials with ASHA early stopping. Average trial runs ~3 hours due to early termination.

Componentio.netAWSAnyscale
8x A100 80GB x 8 hours$76.80$327.68$224.00
Effective cost (ASHA kills 60% early)~$45~$195~$134
Estimated bill~$45~$195~$134

For teams running weekly tuning sweeps, that's $600/month on io.net vs $2,600/month on AWS.

Scenario 3: Ray Serve Inference (Monthly)

2 replicas with autoscaling to 8 replicas, average 4 active. A100 80GB.

Componentio.netAWSAnyscale
Average hourly cost (4 GPUs)$5.60$20.48$14.00
Monthly (730 hours)$4,088$14,950$10,220

For always-on serving, the monthly savings approach five figures.

Troubleshooting

Workers not connecting to head node

Symptom: Worker nodes show "running" in io.net dashboard but don't appear in Ray Dashboard.

Fix: Verify the Ray port (default 6379) is accessible between nodes. In io.cloud, check that the cluster networking is set to "internal mesh" mode. If workers were added after initial deployment, they may need the updated head node address:

# On the worker (if SSH access is available)
ray stop
ray start --address="HEAD_NODE_IP:6379"

GPU not visible to Ray

Symptom: ray.cluster_resources() shows GPU: 0 even though GPUs are provisioned.

Fix: Ensure CUDA drivers are loaded. Ray detects GPUs at startup. If you installed CUDA after Ray started, restart Ray on the affected nodes:

# Quick check from a Ray task
@ray.remote(num_gpus=1)
def check_gpu():
    import torch
    return torch.cuda.is_available(), torch.cuda.get_device_name(0)

ray.get(check_gpu.remote())

Out of object store memory

Symptom: RayOutOfMemoryError or tasks stuck in pending state.

Fix: Increase the object store allocation in your cluster config. By default, Ray uses 30% of system memory. For data-heavy workloads:

ray.init(
    _system_config={"object_store_memory": 40_000_000_000}  # 40GB
)

Also check for memory leaks — large objects returned from tasks are stored in the object store. Use ray.internal.free(object_ref) to release references you no longer need.

Training job hangs after worker failure

Symptom: Distributed training freezes when one worker drops.

Fix: Enable elastic training with failure handling:

from ray.train import FailureConfig

run_config = RunConfig(
    failure_config=FailureConfig(max_failures=3)
)

And ensure you're checkpointing to remote storage (not local disk) so the replacement worker can resume from the latest state.

Slow data loading across nodes

Symptom: GPU utilization is low, workers waiting for data.

Fix: Use Ray Data for distributed data loading instead of a single-node DataLoader. Alternatively, stage your data to each worker's local disk before training:

# Pre-stage data to local disk on each worker
@ray.remote(num_gpus=1)
def stage_data_and_train(data_url, config):
    import subprocess
    subprocess.run(["aws", "s3", "sync", data_url, "/local/data/"])
    return run_training("/local/data/", config)

Frequently Asked Questions

What Ray version does io.net support?

io.net supports Ray 2.x (currently 2.44.0 as of early 2026). The platform pre-installs Ray on cluster nodes, and you can specify your target version in the cluster configuration. Custom versions can be set via the ray_version field in your cluster YAML.

Can I use Ray with PyTorch FSDP or DeepSpeed on io.net?

Yes. Ray Train integrates natively with both PyTorch Fully Sharded Data Parallel (FSDP) and DeepSpeed. Since io.net provisions standard NVIDIA GPU instances with CUDA and NCCL, all distributed training strategies work as expected. Specify your strategy in the TorchTrainer configuration:

from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
    torch_config=ray.train.torch.TorchConfig(
        backend="nccl",
        fsdp_config={"sharding_strategy": "FULL_SHARD"}
    )
)

How do I persist checkpoints and results?

Use remote storage (S3-compatible or GCS) as your checkpoint destination. io.net worker local disks are ephemeral. Configure your RunConfig with a remote storage_path:

run_config = RunConfig(
    storage_path="s3://my-bucket/ray-results/",
    checkpoint_config=CheckpointConfig(num_to_keep=5)
)

This ensures checkpoints survive node restarts and cluster teardowns.

Multi-GPU nodes on io.net that use H100 SXM or A100 SXM hardware have NVLink within the node, just like any other cloud provider. Cross-node communication uses standard networking (TCP/RDMA where available). For workloads requiring heavy all-reduce operations across nodes, the cross-node bandwidth is the bottleneck regardless of provider — focus on minimizing communication with gradient compression or large batch sizes.

Can I mix GPU types in a single Ray cluster?

Ray supports heterogeneous clusters, and you can configure this on io.net by specifying different GPU types for different worker groups. This is useful for workloads like hyperparameter tuning (where individual trials don't need top-tier GPUs) or pipelines where preprocessing runs on cheaper GPUs and training runs on H100s:

@ray.remote(num_gpus=1, resources={"A100": 1})
def training_task(data):
    ...

@ray.remote(num_gpus=1, resources={"RTX4090": 1})
def preprocessing_task(raw_data):
    ...

How does billing work for autoscaling clusters?

io.net bills per-minute per node. When the autoscaler adds workers, billing starts when the node is provisioned. When idle workers are removed after the idle_timeout, billing stops. You only pay for compute you're using. There are no minimum commitments or reservation fees for on-demand clusters.

Conclusion

Ray is the framework that makes distributed ML practical. io.net is the infrastructure that makes it affordable.

Setting up a Ray cluster on io.net takes under 2 minutes, costs 60-75% less than AWS, and gives you access to 320,000+ GPUs without capacity constraints. Whether you're running distributed training with Ray Train, sweeping hyperparameters with Ray Tune, deploying models with Ray Serve, or processing data with Ray Data, the workflow is the same: configure, deploy, connect, submit.

The combination of Ray's mature distributed computing abstractions and io.net's decentralized GPU supply means you can run ML workloads at scales that would be prohibitively expensive on traditional cloud providers. Stop waiting for GPU quota approvals. Stop overpaying for infrastructure that sits idle between training runs.

Start building on io.net:

  1. Create your io.net account
  2. Deploy a Ray cluster in under 2 minutes
  3. Submit your first distributed training job

Your models won't train themselves — but they shouldn't bankrupt you either.