Cold starts are the silent killer of GPU inference performance. Your model might generate tokens at 100/sec once it's running, but if it takes 45 seconds to load into GPU memory when a new request arrives after idle time, your users are staring at a spinner. On serverless GPU platforms, cold starts can add 30-120 seconds of latency. On io.net, the situation is better — persistent instances stay warm — but you still need to optimize model loading if you're auto-scaling or restarting containers.
Here's how to get cold start times under 10 seconds for most production models.
Where the Time Goes
A typical cold start for a 7B model breaks down like:
| Step | Time (unoptimized) | Time (optimized) |
|---|---|---|
| Container startup | 5-15s | 2-3s |
| CUDA initialization | 3-5s | 3-5s (fixed) |
| Model download (if not cached) | 30-120s | 0s (pre-cached) |
| Model loading to GPU | 15-30s | 3-8s |
| Inference engine warmup | 5-10s | 1-3s |
| Total | 58-180s | 9-19s |
CUDA initialization is fixed — you can't speed it up. Everything else is in your control.
Optimization 1: Pre-Cache the Model Weights
The biggest win. Don't download model weights at startup.
On io.net, use persistent volumes to keep model files across restarts. Mount a volume at /models, download once, and every subsequent container start reads from local NVMe instead of pulling from HuggingFace or S3.
# First run: download and cache
huggingface-cli download meta-llama/Llama-3-8B --local-dir /models/llama3-8b
# Every subsequent run: already there, ~0 seconds
If you're building a Docker image, bake the model into the image (for models < 10GB) or use a multi-stage build that copies from a pre-populated volume. For larger models (70B = 140GB), a volume mount is the only practical option.
Optimization 2: Use Safetensors Format
The old PyTorch .bin format requires unpickling, which is slow and CPU-bound. Safetensors files are memory-mapped and can be loaded directly into GPU memory with zero deserialization:
# Slow: PyTorch format
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
# Loading time: 22 seconds
# Fast: Safetensors (most HuggingFace models support this now)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
torch_dtype=torch.float16,
device_map="cuda"
)
# Loading time: 6 seconds
The safetensors format is 3-4x faster to load. If your model only has .bin files, convert it once:
from safetensors.torch import save_model
save_model(model, "model.safetensors")
Optimization 3: Warm Up the Inference Engine
vLLM, TensorRT-LLM, and other inference engines compile CUDA kernels on first inference. This adds 3-10 seconds to the first request. Run a dummy inference during startup to absorb this cost:
# After model load, before serving traffic
model.generate(
torch.tensor([[1, 2, 3]]).cuda(), # Dummy input
max_new_tokens=1
)
# CUDA kernels now compiled and cached
For TensorRT-LLM, the engine compilation is even more significant (30-60 seconds) but produces a serialized engine file. Cache this file on your persistent volume — subsequent loads skip compilation entirely.
Optimization 4: Keep Instances Warm
The cheapest cold start is one that never happens. On io.net, you're billed per second, so keeping a minimum pool of warm instances running costs very little compared to the latency penalty of cold starts:
- 1 warm RTX 4090 running 24/7: $129.60/month
- Serves instant responses while auto-scaler provisions additional capacity during traffic spikes
For inference APIs with variable traffic, a common pattern is:
- Minimum 1-2 warm instances (always-on, absorbs baseline traffic)
- Auto-scale additional instances (2-3 minute cold start, acceptable for overflow)
- Scale down to minimum after 10 minutes of low traffic
Optimization 5: Use Smaller Models When Possible
Model loading time is roughly proportional to model size. A 7B model loads in 6 seconds; a 70B model takes 45-60 seconds. If your cold start budget is tight, consider:
- Quantized models (4-bit = 25% of the file size = 25% loading time)
- Distilled models (smaller but trained to match larger model quality)
- Routing: use a fast small model for simple requests, upgrade to larger models only when complexity warrants it
Minimize latency on io.net — persistent volumes, NVMe storage, sub-2-minute provisioning. Deploy inference
