Setting up a Ray cluster on io.net takes 5-10 minutes using the io.net CLI or pre-configured Ray templates. Deploy a head node, add GPU worker nodes (from 1 to 100+), and run distributed training, hyperparameter tuning, or inference serving. io.net handles cluster networking, GPU scheduling, and auto-scaling automatically.
Ray on io.net supports all popular frameworks (Ray Train, Ray Tune, Ray Serve, RLlib) and integrates with PyTorch, TensorFlow, Hugging Face, and XGBoost for distributed workloads. Costs are 60-70% lower than AWS or GCP, with instant GPU availability—no capacity reservations or waitlists required.
Quick Start: Deploy Ray Cluster
Method 1: io.net CLI (Recommended)
# Install io.net CLI
pip install ionet-cli
io login
# Create Ray cluster with GPU workers
io ray create-cluster \
--name ml-cluster \
--head-cpu 8 --head-memory 32GB \
--workers 4 \
--worker-gpu A100 \
--worker-count 4 \
--worker-memory 80GB \
--region us-west
# Returns cluster connection info in ~90 seconds:
# Head node: ray://xxx.ionet.cloud:10001
# Dashboard: https://xxx.ionet.cloud:8265
# Get connection string
io ray get-address ml-cluster
# RAY_ADDRESS=ray://xxx.ionet.cloud:10001
Method 2: Manual Deployment
# 1. Deploy head node
io deploy --image rayproject/ray:latest-gpu \
--cpu 8 --memory 32GB \
--port 10001 --port 8265 \
--command "ray start --head --port=10001 --dashboard-host=0.0.0.0 --dashboard-port=8265" \
--name ray-head
# Get head node address
HEAD_ADDRESS=$(io get-ip ray-head)
# 2. Deploy worker nodes (4x A100)
for i in {1..4}; do
io deploy --image rayproject/ray:latest-gpu \
--gpu A100 --memory 80GB \
--command "ray start --address=$HEAD_ADDRESS:10001" \
--name ray-worker-$i
done
# 3. Verify cluster
io exec ray-head -- ray status
Distributed Training with Ray Train
Example: Fine-tune Llama 3 8B with Ray Train
# train.py
import ray
from ray import train
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
# Connect to Ray cluster
ray.init(address="ray://xxx.ionet.cloud:10001")
def train_func(config):
# Load model on each worker
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
# Training configuration
training_args = TrainingArguments(
output_dir="/workspace/output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-5,
logging_steps=10,
save_steps=500
)
# Distributed training with Ray
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=config["dataset"],
max_seq_length=2048
)
trainer.train()
# Save model
trainer.save_model("/workspace/output/final_model")
# Configure distributed training
scaling_config = ScalingConfig(
num_workers=4, # 4 GPUs
use_gpu=True,
resources_per_worker={"GPU": 1}
)
# Create TorchTrainer
trainer = TorchTrainer(
train_func,
scaling_config=scaling_config,
train_loop_config={"dataset": your_dataset}
)
# Run distributed training
result = trainer.fit()
Run on cluster:
# Submit training job to Ray cluster
python train.py
# Monitor progress in Ray dashboard
# https://xxx.ionet.cloud:8265
Cost: 4x A100 × 6 hours = $26.40 (vs. AWS: $96-120)
Hyperparameter Tuning with Ray Tune
# tune_experiment.py
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer
ray.init(address="ray://xxx.ionet.cloud:10001")
def train_model(config):
# Load model with hyperparameters from config
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
training_args = TrainingArguments(
output_dir="/workspace/tune_output",
learning_rate=config["lr"],
per_device_train_batch_size=config["batch_size"],
num_train_epochs=3,
weight_decay=config["weight_decay"]
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
result = trainer.train()
# Report metrics to Ray Tune
tune.report(eval_loss=result.metrics["eval_loss"])
# Define search space
config = {
"lr": tune.loguniform(1e-5, 1e-3),
"batch_size": tune.choice([16, 32, 64]),
"weight_decay": tune.uniform(0.0, 0.3)
}
# Configure scheduler (early stopping for bad trials)
scheduler = ASHAScheduler(
metric="eval_loss",
mode="min",
max_t=10,
grace_period=1,
reduction_factor=2
)
# Run hyperparameter search
analysis = tune.run(
train_model,
config=config,
num_samples=20, # Try 20 configurations
scheduler=scheduler,
resources_per_trial={"gpu": 1}, # 1 GPU per trial
verbose=1
)
# Get best configuration
best_config = analysis.get_best_config(metric="eval_loss", mode="min")
print(f"Best config: {best_config}")
Run experiment:
python tune_experiment.py
# Ray Tune automatically distributes trials across 4 GPUs
# 20 trials complete in ~45 minutes (vs. 3 hours sequentially)
Inference Serving with Ray Serve
# serve_llama.py
import ray
from ray import serve
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
ray.init(address="ray://xxx.ionet.cloud:10001")
serve.start()
@serve.deployment(
num_replicas=4, # 4 replicas across 4 GPUs
ray_actor_options={"num_gpus": 1}
)
class LlamaModel:
def __init__(self):
self.model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.float16,
device_map="auto"
)
self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
async def __call__(self, request):
data = await request.json()
prompt = data["prompt"]
inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = self.model.generate(**inputs, max_new_tokens=512)
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
return {"response": response}
# Deploy model
LlamaModel.deploy()
# Access at: http://xxx.ionet.cloud:8000
Test API:
curl -X POST http://xxx.ionet.cloud:8000/LlamaModel \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain machine learning in simple terms"}'
Performance: 4 replicas handle 400+ requests/minute with <100ms latency
Auto-Scaling Ray Cluster
Configure automatic scaling based on workload:
# cluster_config.yaml
cluster_name: autoscale-cluster
min_workers: 2
max_workers: 20
upscaling_speed: 1.0
downscaling_mode: aggressive
available_node_types:
ray.worker:
resources: {"GPU": 1}
node_config:
GPU: A100
memory: 80GB
min_workers: 2
max_workers: 20
Deploy with auto-scaling:
io ray create-cluster \
--config cluster_config.yaml \
--autoscale \
--min-workers 2 \
--max-workers 20
# Cluster scales automatically:
# - Scale up: When Ray tasks wait for available GPUs
# - Scale down: When GPU utilization < 50% for 5+ minutes
# - New workers provisioned in <2 minutes
Distributed Data Processing with Ray
# data_pipeline.py
import ray
import pandas as pd
ray.init(address="ray://xxx.ionet.cloud:10001")
@ray.remote(num_gpus=1)
def process_batch(batch_data):
# GPU-accelerated data processing
import cudf # GPU DataFrame library
# Convert to GPU DataFrame
gdf = cudf.DataFrame(batch_data)
# Apply transformations
gdf['processed'] = gdf['text'].str.lower().str.strip()
# Return to CPU
return gdf.to_pandas()
# Load large dataset (100GB)
df = pd.read_parquet("s3://my-bucket/large_dataset.parquet")
# Split into batches
batches = [df[i:i+10000] for i in range(0, len(df), 10000)]
# Process in parallel across all GPUs
futures = [process_batch.remote(batch) for batch in batches]
results = ray.get(futures)
# Combine results
processed_df = pd.concat(results)
# Processing time: 8 minutes on 4x A100 (vs. 45 minutes on single GPU)
Monitoring Ray Cluster
Ray Dashboard:
- Access at: https://xxx.ionet.cloud:8265
- View: Active tasks, resource utilization, worker status
- Metrics: GPU usage, memory, CPU, network
Command-Line Monitoring:
# Cluster status
io ray status ml-cluster
# GPU utilization across workers
io exec ray-head -- ray exec "nvidia-smi" --workers
# View logs
io logs ray-head
io logs ray-worker-1
# Resource availability
io exec ray-head -- ray status --resources
Custom Metrics:
from ray.util.metrics import Counter, Histogram
# Define custom metrics
inference_counter = Counter("inference_requests", "Total inference requests")
latency_histogram = Histogram("inference_latency", "Inference latency in ms")
# Record metrics in your application
inference_counter.inc()
latency_histogram.observe(latency_ms)
# View in Ray dashboard or Prometheus
Cost Optimization Strategies
1. Use Spot-Like Pricing (io.net on-demand = AWS spot):
io.net on-demand A100: $1.10/hour
AWS spot A100: $4.99-6.98/hour (still 4.5-6x more expensive)
io.net is already cheaper than AWS spot instances
2. Auto-Scaling:
# Scale down to 0 workers during idle periods
io ray configure ml-cluster --min-workers 0 --idle-timeout 5m
# Saves: $1.10/hour per idle worker
3. Right-Size GPU Selection:
Small models (<7B): RTX 4090 ($0.18/hr) vs. A100 ($1.10/hr) = 83% savings
Large models (70B+): H100 ($1.49/hr) vs. 8x A100 ($8.80/hr) = 83% savings
Performance Benchmarks
Distributed Training (Llama 3 8B, 10K samples):
| Configuration | Time | Cost | Scaling Efficiency |
|---|---|---|---|
| 1x A100 | 6 hours | $6.60 | 100% |
| 2x A100 | 3.2 hours | $7.04 | 94% |
| 4x A100 | 1.7 hours | $7.48 | 88% |
| 8x A100 | 1.0 hours | $8.80 | 75% |
Hyperparameter Tuning (20 trials, BERT fine-tuning):
| Setup | Time | Cost |
|---|---|---|
| Sequential (1 GPU) | 180 minutes | $3.30 |
| Ray Tune (4 GPUs) | 48 minutes | $3.52 |
| Speedup | 3.75x | Minimal cost increase |
Inference Serving (Llama 3 8B, 10K requests):
| Setup | Throughput | Latency | Cost |
|---|---|---|---|
| Single GPU | 120 req/min | 450ms | $1.10/hr |
| Ray Serve (4 GPUs) | 480 req/min | 110ms | $4.40/hr |
Deploy Ray clusters on io.net with instant GPU access and 70% cost savings vs. AWS.
