Quick Answer
Yes, io.net is purpose-built for AI model training and supports all major deep learning frameworks including PyTorch, TensorFlow, JAX, and HuggingFace Transformers. You can train everything from small vision models to 70B+ parameter LLMs using single GPUs or distributed clusters of up to 100+ GPUs. With H100s at $2.20/hr (vs. $6.98/hr on AWS), io.net offers 68% cost savings for training workloads. The platform supports full fine-tuning, LoRA, QLoRA, and distributed training frameworks like DeepSpeed, FSDP, and Ray, with pre-configured containers that reduce setup time from hours to minutes.
What AI Training Workloads Run on io.net
io.net supports the full spectrum of AI training use cases:
Large Language Model Training:
- Full pre-training (7B-70B+ parameters)
- Fine-tuning with custom datasets
- LoRA and QLoRA efficient fine-tuning
- Multi-GPU distributed training with DeepSpeed, FSDP
- Instruction tuning and alignment (RLHF, DPO)
Computer Vision:
- Image classification and object detection (ResNet, YOLO, Vision Transformers)
- Semantic segmentation (U-Net, Mask R-CNN)
- Generative models (Stable Diffusion, GANs)
- Video understanding and generation
Audio and Speech:
- Speech recognition (Whisper, Wav2Vec)
- Text-to-speech synthesis
- Audio generation and music models
Multimodal Models:
- Vision-language models (CLIP, BLIP, LLaVA)
- Text-to-image generation
- Video captioning and understanding
Real Training Performance Benchmarks
Here's how different training workloads perform on io.net GPUs:
| Model | Task | GPU | Batch Size | Time to Train | Cost on io.net | Cost on AWS | Savings |
|---|---|---|---|---|---|---|---|
| Llama 3 8B | LoRA fine-tune (10K samples) | 1x A100 80GB | 4 | 6 hours | $7.20 | $24.60 | 71% |
| Llama 3 8B | Full fine-tune (50K samples) | 8x A100 80GB | 32 | 48 hours | $573 | $1,574 | 64% |
| Llama 3 70B | LoRA fine-tune (10K samples) | 4x H100 SXM | 2 | 12 hours | $106 | $335 | 68% |
| Stable Diffusion XL | Train from scratch (100K images) | 4x RTX 4090 | 64 | 72 hours | $52 | N/A | N/A |
| ResNet-50 | ImageNet training (1.2M images) | 8x RTX 4090 | 256 | 24 hours | $35 | N/A | N/A |
| Whisper Large | Fine-tune on custom audio | 2x L40S | 16 | 18 hours | $27 | $54 | 50% |
Benchmarks based on standard training configurations. Actual performance varies by hyperparameters and data pipeline efficiency.
Supported Training Frameworks and Tools
io.net provides pre-configured environments for all major AI frameworks:
Deep Learning Frameworks:
- PyTorch: Full support for PyTorch 2.0+ with compiled mode and FSDP
- TensorFlow: TensorFlow 2.x with XLA acceleration
- JAX: Optimized for large-scale training with pjit and SPMD
- HuggingFace: Transformers, Accelerate, PEFT, TRL pre-installed
Training Optimization:
- DeepSpeed: ZeRO stages 1-3 for memory-efficient training
- FSDP (Fully Sharded Data Parallel): PyTorch native distributed training
- Ray Train: Distributed training orchestration
- Horovod: Multi-GPU and multi-node training
- Flash Attention 2: 2-4x faster attention for transformers
Fine-Tuning Libraries:
- Axolotl: One-config fine-tuning for LLMs
- Unsloth: 2x faster LoRA training with reduced memory
- PEFT (Parameter-Efficient Fine-Tuning): LoRA, QLoRA, prefix tuning
- TRL (Transformer Reinforcement Learning): RLHF and DPO
Pre-configured Containers:
# Launch PyTorch training environment
io launch --gpu A100 --image pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
# Launch HuggingFace fine-tuning environment
io launch --gpu A100 --image huggingface/transformers-pytorch-gpu:latest
# Launch Axolotl for one-config LLM fine-tuning
io launch --gpu H100 --image winglian/axolotl:main-py3.11-cu121-2.2.1
How to Train Your First Model on io.net
Step 1: Launch a GPU instance
# For LoRA fine-tuning a 7B model
io launch --gpu A100 --count 1 --disk 100GB
# For full fine-tuning a 70B model
io launch --gpu H100 --count 8 --disk 500GB --network nvlink
Step 2: Set up your training environment
# Install dependencies
pip install torch transformers accelerate peft datasets wandb
# Load your model and dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none"
)
model = get_peft_model(model, lora_config)
Step 3: Run distributed training (for multi-GPU)
# Using Accelerate for distributed training
accelerate launch --multi_gpu --num_processes 8 train.py \
--model_name meta-llama/Llama-3-70B \
--dataset custom_dataset \
--batch_size 4 \
--gradient_accumulation 8 \
--learning_rate 2e-5 \
--num_epochs 3
# Using DeepSpeed ZeRO-3 for memory efficiency
deepspeed --num_gpus=8 train.py \
--deepspeed ds_config_zero3.json \
--model_name meta-llama/Llama-3-70B \
--per_device_train_batch_size 1 \
--gradient_checkpointing
Step 4: Monitor training
# Integrate with Weights & Biases for monitoring
import wandb
wandb.init(project="llama-finetuning")
# Training metrics are logged automatically
# View GPU utilization in io.net dashboard
Multi-GPU and Distributed Training
io.net supports scaling from single GPU to 100+ GPU clusters:
Cluster Configuration Options:
| Setup | Use Case | Network | GPUs | Cost Example |
|---|---|---|---|---|
| Single GPU | LoRA fine-tuning, small models | N/A | 1x A100 | $1.20/hr |
| 2-GPU NVLink | Medium model full fine-tune | NVLink | 2x A100 | $2.40/hr |
| 8-GPU Node | Large model training (70B) | NVLink/NVSwitch | 8x H100 | $17.60/hr |
| Multi-Node | Pre-training, massive datasets | InfiniBand | 64x H100 | $140.80/hr |
Distributed Training Patterns:
# Data Parallel (DP) - Replicate model on each GPU
# Best for: Models that fit on single GPU, large batch sizes
torchrun --nproc_per_node=8 train.py --distributed
# Fully Sharded Data Parallel (FSDP) - Shard model across GPUs
# Best for: Models too large for single GPU (30B+ parameters)
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
model = FSDP(model, auto_wrap_policy=size_based_auto_wrap_policy)
# Pipeline Parallel - Split model layers across GPUs
# Best for: Extremely large models (100B+), maximize throughput
from torch.distributed.pipeline.sync import Pipe
model = Pipe(model, chunks=8)
# DeepSpeed ZeRO - Memory-optimized distributed training
# Best for: Limited GPU memory, 70B+ models on consumer GPUs
deepspeed train.py --deepspeed ds_config.json
Cost Comparison: Training on io.net vs. Competitors
Llama 3 8B Fine-Tuning (50K examples, 3 epochs):
| Provider | GPU | Configuration | Time | Total Cost |
|---|---|---|---|---|
| io.net | 8x A100 80GB | FSDP | 48 hrs | $573 |
| AWS | 8x A100 80GB | p4d.24xlarge | 48 hrs | $1,574 |
| Azure | 8x A100 80GB | ND96asr_v4 | 48 hrs | $1,478 |
| CoreWeave | 8x A100 80GB | Reserved | 48 hrs | $849 |
| Savings vs. AWS | 64% |
Stable Diffusion Training (100K images, 50K steps):
| Provider | GPU | Configuration | Time | Total Cost |
|---|---|---|---|---|
| io.net | 4x RTX 4090 | Data Parallel | 72 hrs | $52 |
| RunPod | 4x RTX 4090 | Spot | 72 hrs | $86 |
| Vast.ai | 4x RTX 4090 | Variable | 72 hrs | $72-120 |
| Lambda Labs | 4x RTX 4090 | (Sold out) | 72 hrs | N/A |
| Savings vs. RunPod | 40% |
Why io.net is Optimized for AI Training
1. High-Bandwidth GPU Interconnects:
Multi-GPU training requires fast GPU-to-GPU communication. io.net clusters include:
- NVLink: 600 GB/s between GPUs (vs. 64 GB/s PCIe)
- NVSwitch: Full all-to-all connectivity for 8-GPU nodes
- InfiniBand: 200-400 Gbps for multi-node training
2. Fast Storage for Datasets:
Training performance bottlenecks often come from data loading, not GPU compute. io.net provides:
- NVMe SSD storage (6,000+ MB/s read speeds)
- Pre-cached common datasets (ImageNet, Common Crawl, The Pile)
- Direct S3/GCS integration for your custom data
3. Checkpoint and Resume:
Long training runs need fault tolerance. io.net supports:
- Automatic checkpointing every N steps
- Resume from last checkpoint on GPU failure
- Checkpoint storage included (no egress fees)
4. Experiment Tracking Integration:
Pre-integrated with Weights & Biases, TensorBoard, MLflow for tracking:
- Training loss curves
- GPU utilization and memory
- Hyperparameter comparison
- Cost per experiment
5. Instant Scaling:
Start with 1 GPU for experimentation, scale to 8+ GPUs for production training:
- Add GPUs mid-run without restarting
- Auto-scaling based on queue depth
- Pay only for active training time
Common Training Scenarios and Recommendations
Scenario 1: Fine-tuning Llama 3 8B for chatbot
- Recommended GPU: 1x A100 80GB ($1.20/hr)
- Method: LoRA with r=16, 4-bit quantization
- Training time: 4-6 hours on 10K examples
- Total cost: ~$5-7 per experiment
Scenario 2: Training custom Stable Diffusion model
- Recommended GPU: 2x RTX 4090 ($0.36/hr)
- Method: DreamBooth or fine-tuning
- Training time: 12-24 hours on 1K images
- Total cost: ~$4-9 per model
Scenario 3: Full fine-tune Llama 3 70B on proprietary data
- Recommended GPU: 8x H100 SXM ($17.60/hr)
- Method: FSDP + Flash Attention 2 + gradient checkpointing
- Training time: 3-5 days on 100K examples
- Total cost: ~$1,267-2,112 per run
Scenario 4: Pre-training 7B model from scratch
- Recommended GPU: 32x H100 SXM ($70.40/hr)
- Method: DeepSpeed ZeRO-3 + pipeline parallel
- Training time: 2-4 weeks on 300B tokens
- Total cost: ~$23,654-47,309 (vs. $80K+ on AWS)
Related Questions
How long does it take to train a Llama 3 model?
LoRA fine-tuning Llama 3 8B on 10K examples takes 4-6 hours on a single A100 80GB. Full fine-tuning the same model takes 48-72 hours on 8x A100 for 50K examples. Training Llama 3 70B requires 8x H100 and takes 3-5 days for full fine-tuning. For reference, pre-training Llama 3 8B from scratch on 15 trillion tokens would take ~$2M in compute costs.
What's the difference between LoRA and full fine-tuning?
LoRA (Low-Rank Adaptation) fine-tunes only 0.1-1% of model parameters, reducing memory usage by 3-4x and training time by 50-70%. It costs $5-10 per experiment vs. $500-1000 for full fine-tuning. Use LoRA for most use cases (chatbots, domain adaptation, instruction following). Use full fine-tuning only when you need maximum model quality and have 50K+ high-quality training examples.
Can I pause and resume training jobs?
Yes. io.net supports checkpoint-based training where your model state is saved every N steps (configurable). If a GPU fails or you stop the job, you can resume from the last checkpoint without losing progress. For long training runs (72+ hours), enable automatic checkpointing every 500-1000 steps. Checkpoints are stored in persistent storage with no egress fees.
Do I need to manage GPU clusters myself?
No. io.net handles cluster orchestration automatically. When you request 8x H100 GPUs, the platform provisions a cluster with proper networking (NVLink/InfiniBand), configures distributed training frameworks, and handles GPU health monitoring. You just run your training script with standard distributed training commands (torchrun, accelerate launch, deepspeed). For advanced users, Kubernetes deployments are also supported.
What happens if my training job fails mid-run?
io.net automatically detects GPU failures and can either (1) migrate your job to a healthy GPU cluster or (2) resume from the last checkpoint on a new cluster. You're only charged for successful compute time. For fault tolerance, enable checkpointing in your training script and io.net will store checkpoints in persistent storage. Most training runs complete successfully, but for critical multi-day jobs, checkpoint every 1-2 hours.
Start Training on io.net Today
Get 68% cost savings on GPU training compared to AWS:
- GPUs available instantly - H100, A100, RTX 4090, L40S
- Pre-configured environments for PyTorch, TensorFlow, HuggingFace
- Distributed training with NVLink, NVSwitch, InfiniBand
- Per-second billing - pay only for active training time
Browse GPU inventory → or launch your first training job →
Last updated: April 2026 | Training benchmarks based on standard configurations with io.net optimized containers
