Migrate from AWS to io.net: Cut GPU Costs 70% (Step-by-Step)

AWS GPU instances are 2-5x more expensive than alternatives that didn't exist three years ago. After January 2026's 15% price increase on P5e instances (announced on a Saturday, naturally), engineering teams are doing the math and realizing a migration they've been putting off could save them six figures annually.

This guide walks through a complete migration from AWS GPU instances to io.net's decentralized GPU cloud. No handwaving. Real code changes, step-by-step instructions, and an honest assessment of what moves easily and what takes work.

The migration itself typically takes 1-2 weeks for containerized workloads. Teams running SageMaker pipelines should plan for 3-4 weeks. Either way, the 50-70% cost reduction pays back the engineering time within the first billing cycle.

Why Teams Are Leaving AWS for GPU Compute

The exodus from AWS GPU instances isn't ideological. It's arithmetic.

The Pricing Gap Is Growing

GPU	AWS Instance	AWS $/hr	io.net $/hr	Savings
H100 SXM	p5.48xlarge (per GPU)	~$6.88	$2.10-$3.50	50-70%
A100 80GB	p4d.24xlarge (per GPU)	~$4.50	$1.20-$2.00	55-73%
A100 40GB	p4d.24xlarge (per GPU)	~$4.10	$0.90-$1.60	61-78%

AWS reduced P5 pricing by 45% in late 2025, then reversed course with a 15% increase in January 2026. The net effect: AWS H100 pricing remains 2-3x higher than decentralized alternatives.

Egress Fees Add Up Fast

AWS charges $0.08-$0.12/GB for data leaving their network. For ML workloads, this is not negligible:

Downloading a fine-tuned 70B model checkpoint: 140GB x $0.09 = $12.60
Moving a training dataset to another provider: 2TB x $0.09 = $180
Monthly checkpoint syncs for a research team: easily $500-1,000/month

io.net charges zero egress fees. The data you generate is yours to move freely.

Reserved Instance Lock-in

AWS offers 1-year and 3-year reserved instances for 30-60% savings on GPU compute. The catch: you commit to paying whether you use the capacity or not. For ML teams whose GPU needs fluctuate month to month, this is a lose-lose. You either overpay on-demand rates or overcommit on reservations.

io.net operates on-demand with no commitments. Deploy for two hours, shut it down, pay for two hours.

GPU Availability

P5 instances (H100) on AWS require quota approvals that regularly take weeks. Some teams report being denied entirely. io.net has 320,000+ GPUs across 130+ countries. Clusters deploy in under 2 minutes, no approval process, no waitlists.

What Migrates Easily vs. What Needs Work

Before you start migrating, assess your workloads honestly.

Easy (1-3 days)

Docker containers. If your training job runs in a container, it runs on io.net. Change the deployment target, keep everything else.
PyTorch training scripts. Standard PyTorch code with torchrun or torch.distributed works out of the box on io.net GPU clusters.
Ray jobs. io.net supports Ray clusters natively. Your Ray scripts need minimal changes.
Inference endpoints. If you're serving models with vLLM, TGI, or a custom Flask/FastAPI server, the migration is straightforward.

Moderate (1-2 weeks)

SageMaker training jobs. SageMaker adds a proprietary wrapper around training. You'll refactor to standard PyTorch + Ray, which is less work than it sounds. (Detailed walkthrough below.)
Custom AMIs with GPU drivers. Replace with Docker images that include your CUDA/cuDNN dependencies.
CloudWatch-dependent monitoring. Swap for Prometheus/Grafana or Weights & Biases, which are platform-agnostic.

Hard (3-4 weeks)

Deeply integrated AWS pipelines. If your workflow chains S3 + Lambda + SageMaker + Step Functions, you're not just migrating GPU compute. You're refactoring an architecture. Consider a hybrid approach: move GPU workloads to io.net, keep orchestration on AWS.
HIPAA/PCI-DSS compliance workloads. io.net offers confidential computing and hardware validation, but if your compliance posture is built around AWS's specific certifications, factor in re-certification time.

Pre-Migration Assessment Checklist

Before writing any migration code, answer these:

[ ] What percentage of our GPU spend is training vs. inference?
[ ] Are workloads containerized today?
[ ] Which AWS services beyond EC2/SageMaker do our GPU jobs depend on?
[ ] How much data egress do we generate monthly?
[ ] Do we have compliance requirements tied to AWS certifications?
[ ] What's our current reserved instance commitment and when does it expire?

Step-by-Step Migration (7 Steps)

Step 1: Audit Current AWS GPU Usage

Start with a clear picture of what you're spending and where.

# Pull GPU instance usage from AWS Cost Explorer CLI
aws ce get-cost-and-usage \
  --time-period Start=2026-01-01,End=2026-04-01 \
  --granularity MONTHLY \
  --filter '{
    "Dimensions": {
      "Key": "INSTANCE_TYPE_FAMILY",
      "Values": ["p4d", "p5", "p5e", "g5", "g6"]
    }
  }' \
  --metrics "BlendedCost" "UsageQuantity" \
  --group-by Type=DIMENSION,Key=INSTANCE_TYPE

# List all running GPU instances right now
aws ec2 describe-instances \
  --filters "Name=instance-type,Values=p4d.*,p5.*,p5e.*,g5.*,g6.*" \
  --query 'Reservations[].Instances[].{ID:InstanceId,Type:InstanceType,State:State.Name,Launch:LaunchTime}' \
  --output table

Document the output. You need: total monthly GPU spend, instance types in use, average utilization per instance, and data transfer costs.

Step 2: Containerize Your Workloads

If your training scripts already run in Docker, skip ahead. If they're running on bare EC2 with custom AMIs, containerize them now.

# Example: PyTorch training container
FROM nvcr.io/nvidia/pytorch:24.03-py3

WORKDIR /workspace

# Install your dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy training code
COPY train.py .
COPY model/ ./model/
COPY configs/ ./configs/

# Default entrypoint
ENTRYPOINT ["torchrun", "--nproc_per_node=gpu", "train.py"]

# Build and test locally (if you have a GPU)
docker build -t my-training-job:latest .
docker run --gpus all my-training-job:latest --config configs/test.yaml

The key insight: a containerized workload is cloud-agnostic. Once it runs in Docker, it runs on any GPU provider. This step is the real migration.

Step 3: Set Up io.net and Deploy a Test Instance

Create an account at io.net and deploy a test GPU instance.

# Install the io.net CLI
pip install ionet-cli

# Authenticate
ionet auth login

# Browse available GPUs
ionet gpu list --type H100

# Deploy a single H100 instance for testing
ionet deploy create \
  --gpu-type H100 \
  --gpu-count 1 \
  --image nvcr.io/nvidia/pytorch:24.03-py3 \
  --name "migration-test"

Clusters deploy in under 2 minutes. Run a quick smoke test:

# SSH into the instance
ionet ssh migration-test

# Verify GPU access
nvidia-smi

# Run a quick PyTorch check
python3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}')"

Step 4: Port Your Training Scripts

For standard PyTorch, the code changes are minimal. The primary differences are environment-level, not code-level.

AWS (SageMaker-style):

# AWS SageMaker training script
import sagemaker
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point='train.py',
    role='arn:aws:iam::012345678901:role/SageMakerRole',
    instance_type='ml.p4d.24xlarge',
    instance_count=1,
    framework_version='2.1',
    py_version='py310',
    hyperparameters={
        'epochs': 10,
        'batch_size': 32,
        'lr': 1e-4
    },
    output_path='s3://my-bucket/output'
)

estimator.fit({'training': 's3://my-bucket/data/train'})

io.net (standard PyTorch):

# train.py --- runs directly, no proprietary wrapper
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def main():
    dist.init_process_group(backend="nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    model = YourModel().cuda(local_rank)
    model = DDP(model, device_ids=[local_rank])

    # Your training loop --- identical to what SageMaker ran internally
    for epoch in range(10):
        train_one_epoch(model, dataloader, optimizer)
        save_checkpoint(model, f"checkpoint_epoch_{epoch}.pt")

if __name__ == "__main__":
    main()

# Launch on io.net cluster
ionet run \
  --gpu-type A100 \
  --gpu-count 8 \
  --image my-training-job:latest \
  --cmd "torchrun --nproc_per_node=8 train.py --epochs 10 --batch_size 32 --lr 1e-4"

The training logic is identical. What changes is the orchestration layer: you replace SageMaker's Estimator with a direct torchrun invocation. Your actual model code, loss functions, and data loading are untouched.

Step 5: Migrate Data

Move training data from S3 to your io.net instance. For large datasets, use parallel transfers.

# Option A: Direct transfer from S3 to io.net instance
# (Requires AWS CLI configured on the instance)
aws s3 sync s3://my-bucket/training-data/ /data/training/ --quiet

# Option B: Use rclone for multi-cloud transfers
rclone sync s3:my-bucket/training-data /data/training/ \
  --transfers 16 \
  --checkers 8 \
  --progress

# Option C: For datasets under 100GB, a simple wget/curl from a public URL
wget -q https://my-data-url.com/dataset.tar.gz -O /data/dataset.tar.gz
tar -xzf /data/dataset.tar.gz -C /data/training/

Remember: io.net has no egress fees. Once data is on the instance, you can freely download results, checkpoints, and trained models without additional charges.

Step 6: Run Parallel Validation

Before cutting over, run the same training job on both platforms and compare results.

# validation_compare.py
import json

def compare_runs(aws_metrics_path, ionet_metrics_path):
    with open(aws_metrics_path) as f:
        aws = json.load(f)
    with open(ionet_metrics_path) as f:
        ionet = json.load(f)

    print("=== Training Validation Report ===")
    print(f"{'Metric':<25} {'AWS':>12} {'io.net':>12} {'Delta':>12}")
    print("-" * 65)

    for metric in ['final_loss', 'eval_accuracy', 'throughput_samples_sec']:
        a_val = aws.get(metric, 'N/A')
        i_val = ionet.get(metric, 'N/A')
        if isinstance(a_val, (int, float)) and isinstance(i_val, (int, float)):
            delta = ((i_val - a_val) / a_val) * 100
            print(f"{metric:<25} {a_val:>12.4f} {i_val:>12.4f} {delta:>+11.2f}%")
        else:
            print(f"{metric:<25} {str(a_val):>12} {str(i_val):>12} {'---':>12}")

    # Check for divergence
    loss_diff = abs(aws['final_loss'] - ionet['final_loss']) / aws['final_loss']
    if loss_diff < 0.02:
        print("\nVERDICT: Results within 2% tolerance. Safe to migrate.")
    else:
        print(f"\nWARNING: Loss divergence of {loss_diff:.1%}. Investigate before cutover.")

compare_runs("aws_results.json", "ionet_results.json")

What you're validating:

Final loss should match within 1-2% (floating-point differences between GPU batches are normal)
Eval accuracy should be equivalent
Throughput may differ slightly due to interconnect differences in multi-GPU setups

Step 7: Cut Over and Decommission

Once validation passes:

# 1. Update your CI/CD pipeline to target io.net
# Replace AWS deployment commands with io.net equivalents

# 2. Point scheduled training jobs to io.net
# Example: cron job or Airflow DAG update
# OLD: aws sagemaker create-training-job ...
# NEW: ionet run --gpu-type H100 --gpu-count 8 ...

# 3. Terminate AWS GPU instances
aws ec2 terminate-instances --instance-ids i-0abc123def456 i-0def789ghi012

# 4. Cancel reserved instances (if applicable --- check break fees)
aws ec2 describe-reserved-instances \
  --filters "Name=state,Values=active" \
  --query 'ReservedInstances[?InstanceType.starts_with(@,`p`)].{ID:ReservedInstancesId,Type:InstanceType,End:End}'

# 5. Delete unused EBS volumes and S3 training buckets
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'Volumes[].{ID:VolumeId,Size:Size,Created:CreateTime}' \
  --output table

Keep your AWS account active for non-GPU services (S3 storage, Lambda, etc.) if needed. Most teams adopt a hybrid approach: GPU compute on io.net, everything else stays on AWS until a full cloud migration makes sense.

SageMaker to io.net Migration

SageMaker is the most common migration path we see. Here's how each component maps.

SageMaker Training Jobs to Ray Clusters

SageMaker wraps standard PyTorch in a proprietary estimator. Unwrapping it is straightforward.

SageMaker (before):

from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point='train.py',
    instance_type='ml.p4d.24xlarge',
    instance_count=2,
    distribution={'torch_distributed': {'enabled': True}}
)
estimator.fit({'train': 's3://bucket/data'})

Ray on io.net (after):

import ray
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

ray.init()

trainer = TorchTrainer(
    train_loop_per_worker=your_training_function,
    scaling_config=ScalingConfig(
        num_workers=16,       # 2 nodes x 8 GPUs
        use_gpu=True,
        resources_per_worker={"GPU": 1}
    ),
    datasets={"train": ray.data.read_parquet("/data/train/")}
)

result = trainer.fit()
print(f"Final loss: {result.metrics['loss']}")

The benefit: Ray is an open standard. Your code runs on io.net, on-prem, or any other cloud. You're never locked in again.

SageMaker Endpoints to io.intelligence

If you're using SageMaker for model inference, io.intelligence offers an even simpler migration path: an OpenAI-compatible API with 25+ models.

SageMaker endpoint (before):

import boto3

runtime = boto3.client('sagemaker-runtime')

response = runtime.invoke_endpoint(
    EndpointName='my-llm-endpoint',
    ContentType='application/json',
    Body=json.dumps({
        "inputs": "Explain quantum computing",
        "parameters": {"max_new_tokens": 256}
    })
)

result = json.loads(response['Body'].read())

io.intelligence (after):

from openai import OpenAI

client = OpenAI(
    base_url="https://api.intelligence.io.net/api/v1",
    api_key="your-ionet-api-key"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3-70B-Instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=256
)

print(response.choices[0].message.content)

Same OpenAI SDK. Change two lines: base_url and api_key. Every tool, framework, and library that supports the OpenAI API works with io.intelligence out of the box.

API Migration for Inference Workloads

If your application calls AWS Bedrock or SageMaker endpoints for inference, the migration to io.intelligence is the simplest part of this guide.

AWS Bedrock (before):

import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.invoke_model(
    modelId='meta.llama3-70b-instruct-v1:0',
    body=json.dumps({
        "prompt": "Summarize this document: ...",
        "max_gen_len": 512,
        "temperature": 0.7
    })
)

result = json.loads(response['body'].read())

io.intelligence (after):

from openai import OpenAI

client = OpenAI(
    base_url="https://api.intelligence.io.net/api/v1",
    api_key="your-ionet-api-key"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3-70B-Instruct",
    messages=[{"role": "user", "content": "Summarize this document: ..."}],
    max_tokens=512,
    temperature=0.7
)

Every AWS Bedrock model call becomes a standard OpenAI-compatible API call. No proprietary SDKs, no region-specific endpoints, no IAM role configuration.

Cost Savings Calculator

Here's what migration actually saves across three common configurations.

Scenario	AWS Setup	Monthly AWS Cost	Monthly io.net Cost	Annual Savings
Training cluster	8x A100 80GB, 24/7	$25,920	$8,640	$207,360
Inference server	1x H100, 24/7	$4,954	$1,512-$2,520	$29,208-$41,304
Burst training	8x H100, 200 hrs/mo	$11,008	$3,360-$5,600	$64,896-$91,776

How these were calculated:

Training cluster: AWS P4d = 8 x $4.50/hr x 720 hrs = $25,920. io.net = 8 x $1.50/hr x 720 hrs = $8,640.
Inference server: AWS P5 per-GPU = $6.88/hr x 720 hrs = $4,954. io.net H100 = $2.10-$3.50/hr x 720 hrs = $1,512-$2,520.
Burst training: AWS P5 = 8 x $6.88/hr x 200 hrs = $11,008. io.net = 8 x ($2.10-$3.50)/hr x 200 hrs = $3,360-$5,600.

These numbers exclude AWS egress fees, which add $500-$2,000/month for active ML teams. They also exclude AWS reserved instance discounts, which would require 1-3 year lock-in commitments.

What You Lose (and What You Gain)

An honest migration assessment requires acknowledging tradeoffs.

What You Lose

AWS ecosystem integration. S3, Lambda, Step Functions, CloudWatch, IAM --- these are deeply integrated on AWS. On io.net, you'll use separate tools for each (which are often better individually but less tightly coupled).
IAM and fine-grained access control. AWS IAM is industry-leading for access management. io.net's access model is simpler, which means less granularity.
Native CloudWatch monitoring. You'll switch to Prometheus/Grafana, Datadog, or Weights & Biases. These are arguably better for ML workloads, but it's another migration.
Compliance certifications. If your organization requires SOC 2, HIPAA, or PCI-DSS specifically tied to AWS's certifications, this needs separate evaluation. io.net offers confidential computing and hardware validation, but the certification landscape differs.

What You Gain

50-70% cost reduction. The primary driver. For a team spending $30,000/month on AWS GPUs, that's $180,000-$252,000/year back in the budget.
No egress fees. Move data freely. Download checkpoints, share models, sync datasets without watching a fee meter.
Faster deployment. Clusters deploy in under 2 minutes. No quota approvals, no capacity reservations, no support tickets.
Global GPU availability. 320,000+ GPUs across 130+ countries. No single region is a bottleneck.
No lock-in. Containerized workloads on standard frameworks (PyTorch, Ray, Kubernetes) move freely between providers. You're never stuck.
Open standards. io.net uses Ray, Kubernetes, containers, and OpenAI-compatible APIs. Everything you build is portable.

Frequently Asked Questions

How long does a typical migration from AWS to io.net take?

For containerized workloads, expect 1-2 weeks including validation. SageMaker pipelines take 3-4 weeks because you're refactoring proprietary wrappers to standard PyTorch + Ray. Deeply integrated AWS architectures (S3 + Lambda + SageMaker + Step Functions) may take 4-6 weeks for the GPU compute portion alone.

Will my training results be identical on io.net?

Results should match within 1-2%. Minor floating-point differences between GPU runs are normal and expected --- they occur even between two identical AWS instances. Run parallel validation (Step 6) to confirm before cutting over. If loss diverges beyond 2%, investigate batch size, random seeds, and data loading order.

Can I keep using S3 for data storage while running compute on io.net?

Yes. Many teams adopt a hybrid approach: S3 for persistent storage, io.net for GPU compute. Use aws s3 sync or rclone to transfer data to your io.net instance at the start of each job. Since io.net has no egress fees, downloading results back to S3 is only subject to AWS ingress pricing (which is free for most cases).

Does io.net support multi-node training?

Yes. io.net supports multi-GPU clusters via Ray, Kubernetes, and native torchrun distributed training. You can deploy clusters with 8, 16, 64, or more GPUs across multiple nodes. Clusters deploy in under 2 minutes, and inter-node communication uses high-speed interconnects.

What about uptime and reliability?

io.net's decentralized architecture means no single point of failure. If one node goes down, workloads can be rescheduled to available GPUs from the network's 320,000+ GPU pool across 130+ countries. For long-running training jobs, implement standard checkpointing (which you should be doing regardless of provider) to handle any interruptions.

How do I handle secrets and environment variables?

io.net supports environment variable injection at deploy time and integrates with standard secrets management tools. Replace AWS Secrets Manager references with environment variables or mount secrets from your preferred vault (HashiCorp Vault, Doppler, etc.) during container startup.

Conclusion

Migrating from AWS GPU instances to io.net is not a rip-and-replace. It's a targeted move: take the most expensive line item on your cloud bill (GPU compute) and move it to a platform that charges 50-70% less for equivalent hardware.

The migration path is well-defined. Containerize your workloads. Deploy a test cluster on io.net. Validate results in parallel. Cut over when the numbers match. Keep AWS for what it does well (managed services, compliance, ecosystem) and use io.net for what it does better (affordable, available GPU compute with no egress fees).

For a team spending $25,000/month on AWS GPUs, the math is simple: 70% savings is $210,000/year. The 2-week migration pays for itself before the first io.net invoice arrives.

Ready to start? Deploy your first GPU cluster at io.net --- clusters are live in under 2 minutes, and you can run a side-by-side comparison with your AWS setup before committing to anything.