FAQ: How Do I Reduce GPU Cold Start Times for Inference?

Cold starts are the silent killer of GPU inference performance. Your model might generate tokens at 100/sec once it's running, but if it takes 45 seconds to load into GPU memory when a new request arrives after idle time, your users are staring at a spinner. On serverless GPU platforms, cold starts can add 30-120 seconds of latency. On io.net, the situation is better — persistent instances stay warm — but you still need to optimize model loading if you're auto-scaling or restarting containers.

Here's how to get cold start times under 10 seconds for most production models.

Where the Time Goes

A typical cold start for a 7B model breaks down like:

Step	Time (unoptimized)	Time (optimized)
Container startup	5-15s	2-3s
CUDA initialization	3-5s	3-5s (fixed)
Model download (if not cached)	30-120s	0s (pre-cached)
Model loading to GPU	15-30s	3-8s
Inference engine warmup	5-10s	1-3s
Total	58-180s	9-19s

CUDA initialization is fixed — you can't speed it up. Everything else is in your control.

Optimization 1: Pre-Cache the Model Weights

The biggest win. Don't download model weights at startup.

On io.net, use persistent volumes to keep model files across restarts. Mount a volume at /models, download once, and every subsequent container start reads from local NVMe instead of pulling from HuggingFace or S3.

# First run: download and cache
huggingface-cli download meta-llama/Llama-3-8B --local-dir /models/llama3-8b

# Every subsequent run: already there, ~0 seconds

If you're building a Docker image, bake the model into the image (for models < 10GB) or use a multi-stage build that copies from a pre-populated volume. For larger models (70B = 140GB), a volume mount is the only practical option.

Optimization 2: Use Safetensors Format

The old PyTorch .bin format requires unpickling, which is slow and CPU-bound. Safetensors files are memory-mapped and can be loaded directly into GPU memory with zero deserialization:

# Slow: PyTorch format
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
# Loading time: 22 seconds

# Fast: Safetensors (most HuggingFace models support this now)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    torch_dtype=torch.float16,
    device_map="cuda"
)
# Loading time: 6 seconds

The safetensors format is 3-4x faster to load. If your model only has .bin files, convert it once:

from safetensors.torch import save_model
save_model(model, "model.safetensors")

Optimization 3: Warm Up the Inference Engine

vLLM, TensorRT-LLM, and other inference engines compile CUDA kernels on first inference. This adds 3-10 seconds to the first request. Run a dummy inference during startup to absorb this cost:

# After model load, before serving traffic
model.generate(
    torch.tensor([[1, 2, 3]]).cuda(),  # Dummy input
    max_new_tokens=1
)
# CUDA kernels now compiled and cached

For TensorRT-LLM, the engine compilation is even more significant (30-60 seconds) but produces a serialized engine file. Cache this file on your persistent volume — subsequent loads skip compilation entirely.

Optimization 4: Keep Instances Warm

The cheapest cold start is one that never happens. On io.net, you're billed per second, so keeping a minimum pool of warm instances running costs very little compared to the latency penalty of cold starts:

1 warm RTX 4090 running 24/7: $129.60/month
Serves instant responses while auto-scaler provisions additional capacity during traffic spikes

For inference APIs with variable traffic, a common pattern is:
- Minimum 1-2 warm instances (always-on, absorbs baseline traffic)
- Auto-scale additional instances (2-3 minute cold start, acceptable for overflow)
- Scale down to minimum after 10 minutes of low traffic

Optimization 5: Use Smaller Models When Possible

Model loading time is roughly proportional to model size. A 7B model loads in 6 seconds; a 70B model takes 45-60 seconds. If your cold start budget is tight, consider:

Quantized models (4-bit = 25% of the file size = 25% loading time)
Distilled models (smaller but trained to match larger model quality)
Routing: use a fast small model for simple requests, upgrade to larger models only when complexity warrants it

Minimize latency on io.net — persistent volumes, NVMe storage, sub-2-minute provisioning. Deploy inference