Setting up a Ray cluster on io.net takes 5-10 minutes using the io.net CLI or pre-configured Ray templates. Deploy a head node, add GPU worker nodes (from 1 to 100+), and run distributed training, hyperparameter tuning, or inference serving. io.net handles cluster networking, GPU scheduling, and auto-scaling automatically.

Ray on io.net supports all popular frameworks (Ray Train, Ray Tune, Ray Serve, RLlib) and integrates with PyTorch, TensorFlow, Hugging Face, and XGBoost for distributed workloads. Costs are 60-70% lower than AWS or GCP, with instant GPU availability—no capacity reservations or waitlists required.

Quick Start: Deploy Ray Cluster

Method 1: io.net CLI (Recommended)

# Install io.net CLI
pip install ionet-cli
io login

# Create Ray cluster with GPU workers
io ray create-cluster \
  --name ml-cluster \
  --head-cpu 8 --head-memory 32GB \
  --workers 4 \
  --worker-gpu A100 \
  --worker-count 4 \
  --worker-memory 80GB \
  --region us-west

# Returns cluster connection info in ~90 seconds:
# Head node: ray://xxx.ionet.cloud:10001
# Dashboard: https://xxx.ionet.cloud:8265

# Get connection string
io ray get-address ml-cluster
# RAY_ADDRESS=ray://xxx.ionet.cloud:10001

Method 2: Manual Deployment

# 1. Deploy head node
io deploy --image rayproject/ray:latest-gpu \
  --cpu 8 --memory 32GB \
  --port 10001 --port 8265 \
  --command "ray start --head --port=10001 --dashboard-host=0.0.0.0 --dashboard-port=8265" \
  --name ray-head

# Get head node address
HEAD_ADDRESS=$(io get-ip ray-head)

# 2. Deploy worker nodes (4x A100)
for i in {1..4}; do
  io deploy --image rayproject/ray:latest-gpu \
    --gpu A100 --memory 80GB \
    --command "ray start --address=$HEAD_ADDRESS:10001" \
    --name ray-worker-$i
done

# 3. Verify cluster
io exec ray-head -- ray status

Distributed Training with Ray Train

Example: Fine-tune Llama 3 8B with Ray Train

# train.py
import ray
from ray import train
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer

# Connect to Ray cluster
ray.init(address="ray://xxx.ionet.cloud:10001")

def train_func(config):
    # Load model on each worker
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Meta-Llama-3-8B",
        torch_dtype=torch.float16,
        device_map="auto"
    )

    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

    # Training configuration
    training_args = TrainingArguments(
        output_dir="/workspace/output",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-5,
        logging_steps=10,
        save_steps=500
    )

    # Distributed training with Ray
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=config["dataset"],
        max_seq_length=2048
    )

    trainer.train()

    # Save model
    trainer.save_model("/workspace/output/final_model")

# Configure distributed training
scaling_config = ScalingConfig(
    num_workers=4,  # 4 GPUs
    use_gpu=True,
    resources_per_worker={"GPU": 1}
)

# Create TorchTrainer
trainer = TorchTrainer(
    train_func,
    scaling_config=scaling_config,
    train_loop_config={"dataset": your_dataset}
)

# Run distributed training
result = trainer.fit()

Run on cluster:

# Submit training job to Ray cluster
python train.py

# Monitor progress in Ray dashboard
# https://xxx.ionet.cloud:8265

Cost: 4x A100 × 6 hours = $26.40 (vs. AWS: $96-120)

Hyperparameter Tuning with Ray Tune

# tune_experiment.py
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer

ray.init(address="ray://xxx.ionet.cloud:10001")

def train_model(config):
    # Load model with hyperparameters from config
    model = AutoModelForSequenceClassification.from_pretrained(
        "bert-base-uncased",
        num_labels=2
    )

    training_args = TrainingArguments(
        output_dir="/workspace/tune_output",
        learning_rate=config["lr"],
        per_device_train_batch_size=config["batch_size"],
        num_train_epochs=3,
        weight_decay=config["weight_decay"]
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset
    )

    result = trainer.train()

    # Report metrics to Ray Tune
    tune.report(eval_loss=result.metrics["eval_loss"])

# Define search space
config = {
    "lr": tune.loguniform(1e-5, 1e-3),
    "batch_size": tune.choice([16, 32, 64]),
    "weight_decay": tune.uniform(0.0, 0.3)
}

# Configure scheduler (early stopping for bad trials)
scheduler = ASHAScheduler(
    metric="eval_loss",
    mode="min",
    max_t=10,
    grace_period=1,
    reduction_factor=2
)

# Run hyperparameter search
analysis = tune.run(
    train_model,
    config=config,
    num_samples=20,  # Try 20 configurations
    scheduler=scheduler,
    resources_per_trial={"gpu": 1},  # 1 GPU per trial
    verbose=1
)

# Get best configuration
best_config = analysis.get_best_config(metric="eval_loss", mode="min")
print(f"Best config: {best_config}")

Run experiment:

python tune_experiment.py

# Ray Tune automatically distributes trials across 4 GPUs
# 20 trials complete in ~45 minutes (vs. 3 hours sequentially)

Inference Serving with Ray Serve

# serve_llama.py
import ray
from ray import serve
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

ray.init(address="ray://xxx.ionet.cloud:10001")
serve.start()

@serve.deployment(
    num_replicas=4,  # 4 replicas across 4 GPUs
    ray_actor_options={"num_gpus": 1}
)
class LlamaModel:
    def __init__(self):
        self.model = AutoModelForCausalLM.from_pretrained(
            "meta-llama/Meta-Llama-3-8B-Instruct",
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

    async def __call__(self, request):
        data = await request.json()
        prompt = data["prompt"]

        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = self.model.generate(**inputs, max_new_tokens=512)
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

        return {"response": response}

# Deploy model
LlamaModel.deploy()

# Access at: http://xxx.ionet.cloud:8000

Test API:

curl -X POST http://xxx.ionet.cloud:8000/LlamaModel \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain machine learning in simple terms"}'

Performance: 4 replicas handle 400+ requests/minute with <100ms latency

Auto-Scaling Ray Cluster

Configure automatic scaling based on workload:

# cluster_config.yaml
cluster_name: autoscale-cluster

min_workers: 2
max_workers: 20

upscaling_speed: 1.0
downscaling_mode: aggressive

available_node_types:
  ray.worker:
    resources: {"GPU": 1}
    node_config:
      GPU: A100
      memory: 80GB
    min_workers: 2
    max_workers: 20

Deploy with auto-scaling:

io ray create-cluster \
  --config cluster_config.yaml \
  --autoscale \
  --min-workers 2 \
  --max-workers 20

# Cluster scales automatically:
# - Scale up: When Ray tasks wait for available GPUs
# - Scale down: When GPU utilization < 50% for 5+ minutes
# - New workers provisioned in <2 minutes

Distributed Data Processing with Ray

# data_pipeline.py
import ray
import pandas as pd

ray.init(address="ray://xxx.ionet.cloud:10001")

@ray.remote(num_gpus=1)
def process_batch(batch_data):
    # GPU-accelerated data processing
    import cudf  # GPU DataFrame library

    # Convert to GPU DataFrame
    gdf = cudf.DataFrame(batch_data)

    # Apply transformations
    gdf['processed'] = gdf['text'].str.lower().str.strip()

    # Return to CPU
    return gdf.to_pandas()

# Load large dataset (100GB)
df = pd.read_parquet("s3://my-bucket/large_dataset.parquet")

# Split into batches
batches = [df[i:i+10000] for i in range(0, len(df), 10000)]

# Process in parallel across all GPUs
futures = [process_batch.remote(batch) for batch in batches]
results = ray.get(futures)

# Combine results
processed_df = pd.concat(results)

# Processing time: 8 minutes on 4x A100 (vs. 45 minutes on single GPU)

Monitoring Ray Cluster

Ray Dashboard:
- Access at: https://xxx.ionet.cloud:8265
- View: Active tasks, resource utilization, worker status
- Metrics: GPU usage, memory, CPU, network

Command-Line Monitoring:

# Cluster status
io ray status ml-cluster

# GPU utilization across workers
io exec ray-head -- ray exec "nvidia-smi" --workers

# View logs
io logs ray-head
io logs ray-worker-1

# Resource availability
io exec ray-head -- ray status --resources

Custom Metrics:

from ray.util.metrics import Counter, Histogram

# Define custom metrics
inference_counter = Counter("inference_requests", "Total inference requests")
latency_histogram = Histogram("inference_latency", "Inference latency in ms")

# Record metrics in your application
inference_counter.inc()
latency_histogram.observe(latency_ms)

# View in Ray dashboard or Prometheus

Cost Optimization Strategies

1. Use Spot-Like Pricing (io.net on-demand = AWS spot):

io.net on-demand A100: $1.10/hour
AWS spot A100: $4.99-6.98/hour (still 4.5-6x more expensive)
io.net is already cheaper than AWS spot instances

2. Auto-Scaling:

# Scale down to 0 workers during idle periods
io ray configure ml-cluster --min-workers 0 --idle-timeout 5m

# Saves: $1.10/hour per idle worker

3. Right-Size GPU Selection:

Small models (<7B): RTX 4090 ($0.18/hr) vs. A100 ($1.10/hr) = 83% savings
Large models (70B+): H100 ($1.49/hr) vs. 8x A100 ($8.80/hr) = 83% savings

Performance Benchmarks

Distributed Training (Llama 3 8B, 10K samples):

ConfigurationTimeCostScaling Efficiency
1x A1006 hours$6.60100%
2x A1003.2 hours$7.0494%
4x A1001.7 hours$7.4888%
8x A1001.0 hours$8.8075%

Hyperparameter Tuning (20 trials, BERT fine-tuning):

SetupTimeCost
Sequential (1 GPU)180 minutes$3.30
Ray Tune (4 GPUs)48 minutes$3.52
Speedup3.75xMinimal cost increase

Inference Serving (Llama 3 8B, 10K requests):

SetupThroughputLatencyCost
Single GPU120 req/min450ms$1.10/hr
Ray Serve (4 GPUs)480 req/min110ms$4.40/hr

Deploy Ray clusters on io.net with instant GPU access and 70% cost savings vs. AWS.