io.net delivers <50ms inference latency for LLM workloads (time-to-first-token), comparable to centralized clouds. Intra-cluster GPU communication latency is <5ms for distributed training, with NVLink providing <2μs GPU-to-GPU latency on H100 SXM configurations.

Latency Benchmarks

WorkloadLatencyConfiguration
LLM Inference (7B-13B)40-60msSingle GPU (A100)
LLM Inference (70B)80-120ms4x A100
Image Generation (SDXL)4-6 secRTX 4090
GPU-to-GPU (training)<2μsH100 SXM NVLink

Regional Latency

From user to GPU:
- Same region: 10-20ms
- Cross-region (US): 35-50ms
- Cross-continent: 80-150ms

Optimization: Deploy in region nearest to users for lowest latency.


Low-latency GPU inference on io.net — <50ms TTFT, 50+ regions worldwide.