io.net delivers <50ms inference latency for LLM workloads (time-to-first-token), comparable to centralized clouds. Intra-cluster GPU communication latency is <5ms for distributed training, with NVLink providing <2μs GPU-to-GPU latency on H100 SXM configurations.
Latency Benchmarks
| Workload | Latency | Configuration |
|---|---|---|
| LLM Inference (7B-13B) | 40-60ms | Single GPU (A100) |
| LLM Inference (70B) | 80-120ms | 4x A100 |
| Image Generation (SDXL) | 4-6 sec | RTX 4090 |
| GPU-to-GPU (training) | <2μs | H100 SXM NVLink |
Regional Latency
From user to GPU:
- Same region: 10-20ms
- Cross-region (US): 35-50ms
- Cross-continent: 80-150ms
Optimization: Deploy in region nearest to users for lowest latency.
Low-latency GPU inference on io.net — <50ms TTFT, 50+ regions worldwide.
