Yes. Migrating GPU workloads from AWS (EC2, SageMaker, EKS) to io.net is straightforward and typically takes 1-3 days for most deployments. The process involves containerizing your workload (if not already), transferring data, updating endpoint URLs, and redeploying on io.net's infrastructure. Organizations save 50-70% on GPU costs while maintaining comparable performance and reliability.
io.net supports the same GPU types as AWS (H100, A100, A10G), runs standard Docker containers, and provides equivalent networking and storage primitives. The migration path is designed for zero disruption: run workloads in parallel on both platforms during testing, then cut over when validated.
Migration Complexity by Workload Type
| Workload Type | Complexity | Migration Time | Key Considerations |
|---|---|---|---|
| Containerized ML training | Low | 1-2 hours | Direct port, minimal changes |
| Inference API (containerized) | Low | 2-4 hours | Update DNS, load test |
| SageMaker training jobs | Medium | 1-2 days | Convert to container, adapt dataset loading |
| EC2 instances (manual setup) | Medium | 2-3 days | Containerize environment, document dependencies |
| EKS GPU clusters | Medium-High | 3-5 days | Port Kubernetes manifests, test scaling |
| Batch processing pipelines | Low-Medium | 1-2 days | Adapt job scheduler, validate outputs |
Step-by-Step Migration Guide
Phase 1: Assessment (1-2 hours)
- Inventory AWS GPU usage:
# List all EC2 GPU instances
aws ec2 describe-instances \
--filters "Name=instance-type,Values=p*,g*" \
--query 'Reservations[].Instances[].[InstanceId,InstanceType,State.Name]'
# Get SageMaker training jobs (last 30 days)
aws sagemaker list-training-jobs \
--max-results 100 \
--creation-time-after $(date -d '30 days ago' +%Y-%m-%d)
# Estimate monthly GPU costs
aws ce get-cost-and-usage \
--time-period Start=2026-03-01,End=2026-04-01 \
--granularity MONTHLY \
--filter file://gpu-filter.json \
--metrics BlendedCost
- Calculate potential savings:
AWS Cost Example (p4d.24xlarge with 8x A100):
- On-demand: $32.77/hour
- 12 hours/day × 30 days = 360 hours/month
- Monthly cost: $11,797
io.net Equivalent (8x A100):
- On-demand: $8.80/hour
- Same usage: 360 hours/month
- Monthly cost: $3,168
- Savings: $8,629/month (73%)
- Identify dependencies:
- [ ] AWS-specific services (S3, EBS, VPC, IAM)
- [ ] Custom AMIs or EC2 user data scripts
- [ ] Security groups and networking configurations
- [ ] Monitoring and logging (CloudWatch)
- [ ] Data sources (RDS, DynamoDB, S3)
Phase 2: Containerization (4-8 hours if needed)
If workload isn't containerized:
# Example: Convert EC2 PyTorch environment to Docker
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
# Install dependencies from requirements.txt
COPY requirements.txt /workspace/
RUN pip install -r /workspace/requirements.txt
# Copy application code
COPY ./src /workspace/src
COPY ./models /workspace/models
# Set working directory
WORKDIR /workspace
# Entry point
CMD ["python", "src/train.py"]
Build and test locally:
docker build -t my-training-job:latest .
docker run --gpus all -it my-training-job:latest
Phase 3: Data Migration (varies by dataset size)
Option A: Direct Transfer (< 1TB)
# Upload to io.net volume from AWS S3
aws s3 cp s3://my-bucket/dataset.tar.gz - | \
io exec --instance my-gpu -- tar xzf - -C /data/
Option B: Parallel Transfer (1TB+)
# Use io.net S3-compatible storage
io storage create --name my-dataset --size 2TB
# Multi-threaded sync from AWS S3
s5cmd --numworkers 32 cp \
s3://my-aws-bucket/* \
https://storage.io.net/my-dataset/
Option C: Dataset Streaming
# Stream from S3 during training (no migration needed)
import boto3
from torch.utils.data import IterableDataset
class S3Dataset(IterableDataset):
def __init__(self, bucket, prefix):
self.s3 = boto3.client('s3')
self.bucket = bucket
self.prefix = prefix
def __iter__(self):
# Stream data directly from S3
for obj in self.s3.list_objects_v2(Bucket=self.bucket, Prefix=self.prefix):
data = self.s3.get_object(Bucket=self.bucket, Key=obj['Key'])
yield process(data['Body'].read())
# Works identically on io.net (AWS credentials remain valid)
Phase 4: Deploy on io.net (1-2 hours)
# 1. Install io.net CLI
pip install ionet-cli
io login
# 2. Deploy training job
io deploy --image my-training-job:latest \
--gpu A100 --count 8 \
--memory 480GB \
--storage 1TB \
--env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
--name training-job
# 3. Monitor deployment
io logs --instance training-job --follow
# 4. Check GPU utilization
io exec --instance training-job -- nvidia-smi
Phase 5: Validation (2-4 hours)
# Run parallel test: AWS vs. io.net
# Compare:
# - Training throughput (samples/sec)
# - Final model accuracy
# - Total training time
# - Network latency to data sources
# Example validation script
python validate_migration.py \
--aws-model s3://aws-bucket/model.pth \
--ionet-model https://storage.io.net/my-dataset/model.pth \
--test-dataset s3://test-data/ \
--metrics accuracy,f1_score
Phase 6: Cutover (1 hour)
# For inference workloads:
# 1. Deploy on io.net
io deploy --image my-api:latest --gpu A100 --port 8000
# 2. Update DNS to point to io.net endpoint
# AWS Route53 or CloudFlare
# Old: api.example.com → AWS Load Balancer
# New: api.example.com → io.net endpoint (xxx.ionet.cloud)
# 3. Monitor traffic and error rates
# 4. Gradually shift traffic (10% → 50% → 100%)
# 5. Decommission AWS resources after 7-day validation
Common Migration Scenarios
Scenario 1: SageMaker Training Job → io.net
AWS SageMaker:
import sagemaker
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point='train.py',
role='arn:aws:iam::xxx:role/SageMakerRole',
instance_type='ml.p4d.24xlarge',
instance_count=1,
framework_version='2.0',
py_version='py310'
)
estimator.fit('s3://my-bucket/data')
io.net equivalent:
# Containerize SageMaker script
docker build -t sagemaker-port:latest \
-f Dockerfile.sagemaker .
# Deploy on io.net
io deploy --image sagemaker-port:latest \
--gpu A100 --count 8 \
--env S3_BUCKET=my-bucket \
--env S3_PREFIX=data
Scenario 2: EKS GPU Cluster → io.net
AWS EKS manifest:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: pytorch
image: pytorch/pytorch:latest
resources:
limits:
nvidia.com/gpu: 4
io.net equivalent:
# io.net has Kubernetes support
io k8s create-cluster --name my-cluster
# Apply same manifests
kubectl --context ionet apply -f gpu-pod.yaml
# Or use simplified CLI
io deploy --image pytorch/pytorch:latest --gpu A100 --count 4
Scenario 3: EC2 Inference API → io.net
AWS setup:
EC2 (p3.2xlarge with 1x V100) → ELB → Route53
- Instance: $3.06/hour
- Load balancer: $16/month
- Data transfer: $0.09/GB
Monthly cost: ~$2,250 (730 hours)
io.net setup:
io deploy --image my-api:latest \
--gpu A100 --replicas 2 \
--autoscale min=1,max=5 \
--port 443 \
--domain api.example.com
# Built-in load balancing, auto-scaling, HTTPS
# Cost: $1.10/hour × 730 hours = $803/month
# Savings: $1,447/month (64%)
AWS-Specific Service Replacements
| AWS Service | io.net Equivalent | Notes |
|---|---|---|
| EC2 P/G instances | io.net GPU instances | Direct replacement |
| SageMaker Training | Containerized training | Convert to Docker |
| SageMaker Inference | io.net deployment + vLLM | API-compatible |
| S3 | S3 (access directly) or io.net storage | AWS creds still work |
| EBS volumes | io.net persistent storage | NVMe SSD, similar performance |
| VPC | io.net private networking | Isolated networks per deployment |
| CloudWatch | io.net dashboard + Prometheus | Metrics API available |
| IAM | io.net RBAC | Team-based access control |
| ELB | Built-in load balancing | Automatic with replicas |
Migration Checklist
- [ ] Audit current AWS GPU usage (instance types, hours, costs)
- [ ] Calculate io.net savings (use pricing calculator)
- [ ] Containerize workloads (if not already Docker-based)
- [ ] Identify data dependencies (S3, databases, APIs)
- [ ] Plan data migration (streaming vs. one-time transfer)
- [ ] Deploy test workload on io.net (validate performance)
- [ ] Run parallel for 7 days (compare metrics side-by-side)
- [ ] Update DNS/endpoints (cutover to io.net)
- [ ] Monitor for 14 days (ensure stability)
- [ ] Decommission AWS resources (terminate instances, clean up)
Performance Comparison: AWS vs. io.net
| Workload | AWS Config | AWS Cost | io.net Config | io.net Cost | Performance Difference |
|---|---|---|---|---|---|
| Llama 3 70B training | 8x A100 (p4d.24xlarge) | $32.77/hr | 8x A100 | $8.80/hr | <5% (comparable) |
| Stable Diffusion API | 1x A10G (g5.2xlarge) | $1.21/hr | 1x RTX 4090 | $0.18/hr | +15% (faster) |
| Batch inference | 4x V100 (p3.8xlarge) | $12.24/hr | 4x A100 | $4.40/hr | +80% (much faster) |
Ready to migrate? Start on io.net and see 50-70% cost savings immediately.
