If a GPU fails during your job, you can migrate your workload to a healthy GPU within 30-60 seconds. Stateful workloads are restored from the latest checkpoint (if configured), and you receive a refund for downtime. Automatic health monitoring detects failures before they impact your work.
Automatic Failover
- Detection: Health monitor identifies failing GPU (overheating, memory errors, unresponsive)
- Checkpoint: Current state saved (if checkpoint interval configured)
- Migration: Workload moved to healthy GPU in same region
- Restore: Job resumes from checkpoint
- Refund: Downtime credited to your account
Total downtime: 30-60 seconds typically
Enable Checkpointing
# Configure automatic checkpoints
io deploy --image pytorch/pytorch:latest \
--gpu A100 \
--checkpoint-interval 10m \
--checkpoint-storage persistent-volume \
--name resilient-job
# If GPU fails, job resumes from last checkpoint
Monitoring
View failure events in dashboard:
- GPU failure reason (temperature, memory, network)
- Downtime duration
- Refund amount
Failure Rate
- GPU failure rate: <0.1% per day (1 failure per 1,000 GPU-days)
- Auto-recovery success: 98%
- Manual intervention needed: <2%
Automatic failover — io.net handles GPU failures with <60 sec recovery.
