Quick Answer

Yes, io.net is purpose-built for AI model training and supports all major deep learning frameworks including PyTorch, TensorFlow, JAX, and HuggingFace Transformers. You can train everything from small vision models to 70B+ parameter LLMs using single GPUs or distributed clusters of up to 100+ GPUs. With H100s at $2.20/hr (vs. $6.98/hr on AWS), io.net offers 68% cost savings for training workloads. The platform supports full fine-tuning, LoRA, QLoRA, and distributed training frameworks like DeepSpeed, FSDP, and Ray, with pre-configured containers that reduce setup time from hours to minutes.

What AI Training Workloads Run on io.net

io.net supports the full spectrum of AI training use cases:

Large Language Model Training:
- Full pre-training (7B-70B+ parameters)
- Fine-tuning with custom datasets
- LoRA and QLoRA efficient fine-tuning
- Multi-GPU distributed training with DeepSpeed, FSDP
- Instruction tuning and alignment (RLHF, DPO)

Computer Vision:
- Image classification and object detection (ResNet, YOLO, Vision Transformers)
- Semantic segmentation (U-Net, Mask R-CNN)
- Generative models (Stable Diffusion, GANs)
- Video understanding and generation

Audio and Speech:
- Speech recognition (Whisper, Wav2Vec)
- Text-to-speech synthesis
- Audio generation and music models

Multimodal Models:
- Vision-language models (CLIP, BLIP, LLaVA)
- Text-to-image generation
- Video captioning and understanding

Real Training Performance Benchmarks

Here's how different training workloads perform on io.net GPUs:

ModelTaskGPUBatch SizeTime to TrainCost on io.netCost on AWSSavings
Llama 3 8BLoRA fine-tune (10K samples)1x A100 80GB46 hours$7.20$24.6071%
Llama 3 8BFull fine-tune (50K samples)8x A100 80GB3248 hours$573$1,57464%
Llama 3 70BLoRA fine-tune (10K samples)4x H100 SXM212 hours$106$33568%
Stable Diffusion XLTrain from scratch (100K images)4x RTX 40906472 hours$52N/AN/A
ResNet-50ImageNet training (1.2M images)8x RTX 409025624 hours$35N/AN/A
Whisper LargeFine-tune on custom audio2x L40S1618 hours$27$5450%

Benchmarks based on standard training configurations. Actual performance varies by hyperparameters and data pipeline efficiency.

Supported Training Frameworks and Tools

io.net provides pre-configured environments for all major AI frameworks:

Deep Learning Frameworks:
PyTorch: Full support for PyTorch 2.0+ with compiled mode and FSDP
TensorFlow: TensorFlow 2.x with XLA acceleration
JAX: Optimized for large-scale training with pjit and SPMD
HuggingFace: Transformers, Accelerate, PEFT, TRL pre-installed

Training Optimization:
DeepSpeed: ZeRO stages 1-3 for memory-efficient training
FSDP (Fully Sharded Data Parallel): PyTorch native distributed training
Ray Train: Distributed training orchestration
Horovod: Multi-GPU and multi-node training
Flash Attention 2: 2-4x faster attention for transformers

Fine-Tuning Libraries:
Axolotl: One-config fine-tuning for LLMs
Unsloth: 2x faster LoRA training with reduced memory
PEFT (Parameter-Efficient Fine-Tuning): LoRA, QLoRA, prefix tuning
TRL (Transformer Reinforcement Learning): RLHF and DPO

Pre-configured Containers:

# Launch PyTorch training environment
io launch --gpu A100 --image pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime

# Launch HuggingFace fine-tuning environment
io launch --gpu A100 --image huggingface/transformers-pytorch-gpu:latest

# Launch Axolotl for one-config LLM fine-tuning
io launch --gpu H100 --image winglian/axolotl:main-py3.11-cu121-2.2.1

How to Train Your First Model on io.net

Step 1: Launch a GPU instance

# For LoRA fine-tuning a 7B model
io launch --gpu A100 --count 1 --disk 100GB

# For full fine-tuning a 70B model
io launch --gpu H100 --count 8 --disk 500GB --network nvlink

Step 2: Set up your training environment

# Install dependencies
pip install torch transformers accelerate peft datasets wandb

# Load your model and dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)
model = get_peft_model(model, lora_config)

Step 3: Run distributed training (for multi-GPU)

# Using Accelerate for distributed training
accelerate launch --multi_gpu --num_processes 8 train.py \
  --model_name meta-llama/Llama-3-70B \
  --dataset custom_dataset \
  --batch_size 4 \
  --gradient_accumulation 8 \
  --learning_rate 2e-5 \
  --num_epochs 3

# Using DeepSpeed ZeRO-3 for memory efficiency
deepspeed --num_gpus=8 train.py \
  --deepspeed ds_config_zero3.json \
  --model_name meta-llama/Llama-3-70B \
  --per_device_train_batch_size 1 \
  --gradient_checkpointing

Step 4: Monitor training

# Integrate with Weights & Biases for monitoring
import wandb
wandb.init(project="llama-finetuning")

# Training metrics are logged automatically
# View GPU utilization in io.net dashboard

Multi-GPU and Distributed Training

io.net supports scaling from single GPU to 100+ GPU clusters:

Cluster Configuration Options:

SetupUse CaseNetworkGPUsCost Example
Single GPULoRA fine-tuning, small modelsN/A1x A100$1.20/hr
2-GPU NVLinkMedium model full fine-tuneNVLink2x A100$2.40/hr
8-GPU NodeLarge model training (70B)NVLink/NVSwitch8x H100$17.60/hr
Multi-NodePre-training, massive datasetsInfiniBand64x H100$140.80/hr

Distributed Training Patterns:

# Data Parallel (DP) - Replicate model on each GPU
# Best for: Models that fit on single GPU, large batch sizes
torchrun --nproc_per_node=8 train.py --distributed

# Fully Sharded Data Parallel (FSDP) - Shard model across GPUs
# Best for: Models too large for single GPU (30B+ parameters)
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
model = FSDP(model, auto_wrap_policy=size_based_auto_wrap_policy)

# Pipeline Parallel - Split model layers across GPUs
# Best for: Extremely large models (100B+), maximize throughput
from torch.distributed.pipeline.sync import Pipe
model = Pipe(model, chunks=8)

# DeepSpeed ZeRO - Memory-optimized distributed training
# Best for: Limited GPU memory, 70B+ models on consumer GPUs
deepspeed train.py --deepspeed ds_config.json

Cost Comparison: Training on io.net vs. Competitors

Llama 3 8B Fine-Tuning (50K examples, 3 epochs):

ProviderGPUConfigurationTimeTotal Cost
io.net8x A100 80GBFSDP48 hrs$573
AWS8x A100 80GBp4d.24xlarge48 hrs$1,574
Azure8x A100 80GBND96asr_v448 hrs$1,478
CoreWeave8x A100 80GBReserved48 hrs$849
Savings vs. AWS64%

Stable Diffusion Training (100K images, 50K steps):

ProviderGPUConfigurationTimeTotal Cost
io.net4x RTX 4090Data Parallel72 hrs$52
RunPod4x RTX 4090Spot72 hrs$86
Vast.ai4x RTX 4090Variable72 hrs$72-120
Lambda Labs4x RTX 4090(Sold out)72 hrsN/A
Savings vs. RunPod40%

Why io.net is Optimized for AI Training

1. High-Bandwidth GPU Interconnects:
Multi-GPU training requires fast GPU-to-GPU communication. io.net clusters include:
NVLink: 600 GB/s between GPUs (vs. 64 GB/s PCIe)
NVSwitch: Full all-to-all connectivity for 8-GPU nodes
InfiniBand: 200-400 Gbps for multi-node training

2. Fast Storage for Datasets:
Training performance bottlenecks often come from data loading, not GPU compute. io.net provides:
- NVMe SSD storage (6,000+ MB/s read speeds)
- Pre-cached common datasets (ImageNet, Common Crawl, The Pile)
- Direct S3/GCS integration for your custom data

3. Checkpoint and Resume:
Long training runs need fault tolerance. io.net supports:
- Automatic checkpointing every N steps
- Resume from last checkpoint on GPU failure
- Checkpoint storage included (no egress fees)

4. Experiment Tracking Integration:
Pre-integrated with Weights & Biases, TensorBoard, MLflow for tracking:
- Training loss curves
- GPU utilization and memory
- Hyperparameter comparison
- Cost per experiment

5. Instant Scaling:
Start with 1 GPU for experimentation, scale to 8+ GPUs for production training:
- Add GPUs mid-run without restarting
- Auto-scaling based on queue depth
- Pay only for active training time

Common Training Scenarios and Recommendations

Scenario 1: Fine-tuning Llama 3 8B for chatbot
Recommended GPU: 1x A100 80GB ($1.20/hr)
Method: LoRA with r=16, 4-bit quantization
Training time: 4-6 hours on 10K examples
Total cost: ~$5-7 per experiment

Scenario 2: Training custom Stable Diffusion model
Recommended GPU: 2x RTX 4090 ($0.36/hr)
Method: DreamBooth or fine-tuning
Training time: 12-24 hours on 1K images
Total cost: ~$4-9 per model

Scenario 3: Full fine-tune Llama 3 70B on proprietary data
Recommended GPU: 8x H100 SXM ($17.60/hr)
Method: FSDP + Flash Attention 2 + gradient checkpointing
Training time: 3-5 days on 100K examples
Total cost: ~$1,267-2,112 per run

Scenario 4: Pre-training 7B model from scratch
Recommended GPU: 32x H100 SXM ($70.40/hr)
Method: DeepSpeed ZeRO-3 + pipeline parallel
Training time: 2-4 weeks on 300B tokens
Total cost: ~$23,654-47,309 (vs. $80K+ on AWS)

How long does it take to train a Llama 3 model?

LoRA fine-tuning Llama 3 8B on 10K examples takes 4-6 hours on a single A100 80GB. Full fine-tuning the same model takes 48-72 hours on 8x A100 for 50K examples. Training Llama 3 70B requires 8x H100 and takes 3-5 days for full fine-tuning. For reference, pre-training Llama 3 8B from scratch on 15 trillion tokens would take ~$2M in compute costs.

What's the difference between LoRA and full fine-tuning?

LoRA (Low-Rank Adaptation) fine-tunes only 0.1-1% of model parameters, reducing memory usage by 3-4x and training time by 50-70%. It costs $5-10 per experiment vs. $500-1000 for full fine-tuning. Use LoRA for most use cases (chatbots, domain adaptation, instruction following). Use full fine-tuning only when you need maximum model quality and have 50K+ high-quality training examples.

Can I pause and resume training jobs?

Yes. io.net supports checkpoint-based training where your model state is saved every N steps (configurable). If a GPU fails or you stop the job, you can resume from the last checkpoint without losing progress. For long training runs (72+ hours), enable automatic checkpointing every 500-1000 steps. Checkpoints are stored in persistent storage with no egress fees.

Do I need to manage GPU clusters myself?

No. io.net handles cluster orchestration automatically. When you request 8x H100 GPUs, the platform provisions a cluster with proper networking (NVLink/InfiniBand), configures distributed training frameworks, and handles GPU health monitoring. You just run your training script with standard distributed training commands (torchrun, accelerate launch, deepspeed). For advanced users, Kubernetes deployments are also supported.

What happens if my training job fails mid-run?

io.net automatically detects GPU failures and can either (1) migrate your job to a healthy GPU cluster or (2) resume from the last checkpoint on a new cluster. You're only charged for successful compute time. For fault tolerance, enable checkpointing in your training script and io.net will store checkpoints in persistent storage. Most training runs complete successfully, but for critical multi-day jobs, checkpoint every 1-2 hours.

Start Training on io.net Today

Get 68% cost savings on GPU training compared to AWS:
GPUs available instantly - H100, A100, RTX 4090, L40S
Pre-configured environments for PyTorch, TensorFlow, HuggingFace
Distributed training with NVLink, NVSwitch, InfiniBand
Per-second billing - pay only for active training time

Browse GPU inventory → or launch your first training job →


Last updated: April 2026 | Training benchmarks based on standard configurations with io.net optimized containers