Time-to-First-Token (TTFT) is the most important latency metric for user-facing LLM applications. It measures the delay between submitting a prompt and receiving the first output token. Below 100ms, responses feel instantaneous. Above 500ms, they feel sluggish. Above 1 second, users question whether the application is working.
For agentic AI workflows with 10 sequential LLM calls, TTFT compounds: 300ms per call means 3 seconds of pure waiting. Reducing that to 100ms cuts wait time to 1 second, a transformative improvement.
This guide covers the complete TTFT optimization stack. With io.net's H100 GPUs at $2.49/hr and the techniques here, sub-100ms TTFT is achievable for most production models.
What Determines TTFT
TTFT = Network_latency + Queue_wait + Prefill_time + First_decode_step
| Phase | Duration | Primary Driver |
|---|---|---|
| Network | 5-200ms | Geographic distance |
| Queue wait | 0-500ms | Server load |
| Prefill | 10-500ms | Prompt length, GPU compute |
| First decode | 5-20ms | Model size, bandwidth |
Prefill dominates for long prompts. It scales linearly with token count and is compute-bound.
TTFT Benchmarks
| GPU | Model | Prompt Length | TTFT |
|---|---|---|---|
| H100 SXM | Llama 3.1 8B | 512 tokens | 12ms |
| H100 SXM | Llama 3.1 8B | 4K tokens | 45ms |
| H100 SXM | Llama 3.1 70B (TP=2) | 512 tokens | 38ms |
| H100 SXM | Llama 3.1 70B (TP=2) | 4K tokens | 125ms |
| A100 80GB | Llama 3.1 70B (TP=2) | 512 tokens | 75ms |
Optimization Techniques
1. Shorter Prompts
Every prompt token adds to prefill time: - Compress system prompts to essentials - Move static context to fine-tuning - Use prefix caching for repeated prefixes
2. Prefix Caching
Cache KV states for shared system prompts:
# vLLM with prefix caching
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--enable-prefix-caching \
--tensor-parallel-size 2
TTFT reduction: 40-70% for long system prompts on subsequent requests.
3. Chunked Prefill
Process prefill in chunks interleaved with decode for other requests:
python -m vllm.entrypoints.openai.api_server \
--enable-chunked-prefill \
--max-num-batched-tokens 4096
4. Quantization
Lower precision reduces prefill compute:
| Precision | TTFT (4K prompt, 70B) |
|---|---|
| FP16 | 125ms |
| FP8 (H100 native) | 70ms |
| INT4 (AWQ) | 50ms |
5. Speculative Decoding
Small draft model generates first tokens while large model catches up:
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="meta-llama/Llama-3.1-8B-Instruct",
num_speculative_tokens=5,
)
6. Model Selection
Smaller models have dramatically lower TTFT:
| Model | TTFT (512 tokens, H100) | Quality |
|---|---|---|
| 8B | 12ms | Good for simple tasks |
| 34B | 25ms | Balanced |
| 70B | 38ms | High quality |
| 405B | 180ms | Maximum quality |
For agent workflows, use 8B for tool calls and 70B for reasoning.
Deploy on io.net Today
Access H100 GPUs at $2.49/hr and A100s at $1.89/hr. No commitments, no minimums. Scale your AI workloads instantly.
Hardware Impact
| GPU | Bandwidth | io.net Price | TTFT Rank |
|---|---|---|---|
| H100 SXM | 3.35 TB/s | $2.49/hr | Best |
| H100 PCIe | 2.0 TB/s | $2.29/hr | Good |
| A100 SXM | 2.0 TB/s | $1.89/hr | Adequate |
| L40S | 864 GB/s | $1.49/hr | Budget |
Tensor Parallelism Impact
| Config | TTFT (70B, 4K) | Cost/hr |
|---|---|---|
| 2x H100 (TP=2) | 42ms | $4.98 |
| 4x H100 (TP=4) | 28ms | $9.96 |
More GPUs reduce TTFT at increasing cost. Find the balance for your SLA.
Architecture Patterns
Dedicated TTFT Endpoint
Separate endpoint optimized for latency: - Small batch sizes - Lower max_model_len - Prefix caching enabled - Chunked prefill enabled
Warm Connections
Maintain persistent HTTP/2 connections to avoid TLS handshake latency (saves ~50ms per request).
Multi-Region for Network Latency
Deploy on io.net in the nearest region to users. Saves 50-150ms of network RTT.
Measuring TTFT
import time, requests
def measure_ttft(endpoint, prompt, n=100):
ttfts = []
for _ in range(n):
start = time.perf_counter()
requests.post(endpoint, json={"prompt": prompt, "max_tokens": 1})
ttfts.append((time.perf_counter() - start) * 1000)
ttfts.sort()
print(f"P50: {ttfts[n//2]:.0f}ms P95: {ttfts[int(n*0.95)]:.0f}ms P99: {ttfts[int(n*0.99)]:.0f}ms")

Frequently Asked Questions
What TTFT should I target?
Chat: <200ms. Agents: <100ms per step. Batch: irrelevant (optimize throughput).
Does quantization help?
Yes. INT4/FP8 quantization directly reduces prefill compute, improving TTFT by 40-60%.
How does prompt length affect TTFT?
Linearly. Doubling prompt length roughly doubles TTFT.
Which framework has the best TTFT?
TensorRT-LLM achieves 10-30% better TTFT than vLLM. vLLM is easier to deploy. Both work well on io.net.
Conclusion
TTFT optimization requires hardware selection (H100 on io.net), model optimization (quantization, prefix caching), and serving tuning (chunked prefill, tensor parallelism). Most teams can achieve sub-100ms TTFT for 8B-34B and sub-200ms for 70B models.
Deploy low-latency inference on io.net. Sign up and optimize your TTFT today.