If a GPU fails during your job, you can migrate your workload to a healthy GPU within 30-60 seconds. Stateful workloads are restored from the latest checkpoint (if configured), and you receive a refund for downtime. Automatic health monitoring detects failures before they impact your work.

Automatic Failover

  1. Detection: Health monitor identifies failing GPU (overheating, memory errors, unresponsive)
  2. Checkpoint: Current state saved (if checkpoint interval configured)
  3. Migration: Workload moved to healthy GPU in same region
  4. Restore: Job resumes from checkpoint
  5. Refund: Downtime credited to your account

Total downtime: 30-60 seconds typically

Enable Checkpointing

# Configure automatic checkpoints
io deploy --image pytorch/pytorch:latest \
  --gpu A100 \
  --checkpoint-interval 10m \
  --checkpoint-storage persistent-volume \
  --name resilient-job

# If GPU fails, job resumes from last checkpoint

Monitoring

View failure events in dashboard:
- GPU failure reason (temperature, memory, network)
- Downtime duration
- Refund amount

Failure Rate

  • GPU failure rate: <0.1% per day (1 failure per 1,000 GPU-days)
  • Auto-recovery success: 98%
  • Manual intervention needed: <2%

Automatic failover — io.net handles GPU failures with <60 sec recovery.