Time-to-First-Token (TTFT) is the most important latency metric for user-facing LLM applications. It measures the delay between submitting a prompt and receiving the first output token. Below 100ms, responses feel instantaneous. Above 500ms, they feel sluggish. Above 1 second, users question whether the application is working.

For agentic AI workflows with 10 sequential LLM calls, TTFT compounds: 300ms per call means 3 seconds of pure waiting. Reducing that to 100ms cuts wait time to 1 second, a transformative improvement.

This guide covers the complete TTFT optimization stack. With io.net's H100 GPUs at $2.49/hr and the techniques here, sub-100ms TTFT is achievable for most production models.

What Determines TTFT

TTFT = Network_latency + Queue_wait + Prefill_time + First_decode_step

PhaseDurationPrimary Driver
Network5-200msGeographic distance
Queue wait0-500msServer load
Prefill10-500msPrompt length, GPU compute
First decode5-20msModel size, bandwidth

Prefill dominates for long prompts. It scales linearly with token count and is compute-bound.

TTFT Benchmarks

GPUModelPrompt LengthTTFT
H100 SXMLlama 3.1 8B512 tokens12ms
H100 SXMLlama 3.1 8B4K tokens45ms
H100 SXMLlama 3.1 70B (TP=2)512 tokens38ms
H100 SXMLlama 3.1 70B (TP=2)4K tokens125ms
A100 80GBLlama 3.1 70B (TP=2)512 tokens75ms

Optimization Techniques

1. Shorter Prompts

Every prompt token adds to prefill time: - Compress system prompts to essentials - Move static context to fine-tuning - Use prefix caching for repeated prefixes

2. Prefix Caching

Cache KV states for shared system prompts:

# vLLM with prefix caching
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--enable-prefix-caching \
--tensor-parallel-size 2

TTFT reduction: 40-70% for long system prompts on subsequent requests.

3. Chunked Prefill

Process prefill in chunks interleaved with decode for other requests:

python -m vllm.entrypoints.openai.api_server \
--enable-chunked-prefill \
--max-num-batched-tokens 4096

4. Quantization

Lower precision reduces prefill compute:

PrecisionTTFT (4K prompt, 70B)
FP16125ms
FP8 (H100 native)70ms
INT4 (AWQ)50ms

5. Speculative Decoding

Small draft model generates first tokens while large model catches up:

from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="meta-llama/Llama-3.1-8B-Instruct",
num_speculative_tokens=5,
)

6. Model Selection

Smaller models have dramatically lower TTFT:

ModelTTFT (512 tokens, H100)Quality
8B12msGood for simple tasks
34B25msBalanced
70B38msHigh quality
405B180msMaximum quality

For agent workflows, use 8B for tool calls and 70B for reasoning.

Deploy on io.net Today

Access H100 GPUs at $2.49/hr and A100s at $1.89/hr. No commitments, no minimums. Scale your AI workloads instantly.

Get Started

Hardware Impact

GPUBandwidthio.net PriceTTFT Rank
H100 SXM3.35 TB/s$2.49/hrBest
H100 PCIe2.0 TB/s$2.29/hrGood
A100 SXM2.0 TB/s$1.89/hrAdequate
L40S864 GB/s$1.49/hrBudget

Tensor Parallelism Impact

ConfigTTFT (70B, 4K)Cost/hr
2x H100 (TP=2)42ms$4.98
4x H100 (TP=4)28ms$9.96

More GPUs reduce TTFT at increasing cost. Find the balance for your SLA.

Architecture Patterns

Dedicated TTFT Endpoint

Separate endpoint optimized for latency: - Small batch sizes - Lower max_model_len - Prefix caching enabled - Chunked prefill enabled

Warm Connections

Maintain persistent HTTP/2 connections to avoid TLS handshake latency (saves ~50ms per request).

Multi-Region for Network Latency

Deploy on io.net in the nearest region to users. Saves 50-150ms of network RTT.

Measuring TTFT

import time, requests

def measure_ttft(endpoint, prompt, n=100):
ttfts = []
for _ in range(n):
start = time.perf_counter()
requests.post(endpoint, json={"prompt": prompt, "max_tokens": 1})
ttfts.append((time.perf_counter() - start) * 1000)
ttfts.sort()
print(f"P50: {ttfts[n//2]:.0f}ms P95: {ttfts[int(n*0.95)]:.0f}ms P99: {ttfts[int(n*0.99)]:.0f}ms")

Frequently Asked Questions

What TTFT should I target?

Chat: <200ms. Agents: <100ms per step. Batch: irrelevant (optimize throughput).

Does quantization help?

Yes. INT4/FP8 quantization directly reduces prefill compute, improving TTFT by 40-60%.

How does prompt length affect TTFT?

Linearly. Doubling prompt length roughly doubles TTFT.

Which framework has the best TTFT?

TensorRT-LLM achieves 10-30% better TTFT than vLLM. vLLM is easier to deploy. Both work well on io.net.

Conclusion

TTFT optimization requires hardware selection (H100 on io.net), model optimization (quantization, prefix caching), and serving tuning (chunked prefill, tensor parallelism). Most teams can achieve sub-100ms TTFT for 8B-34B and sub-200ms for 70B models.


Deploy low-latency inference on io.net. Sign up and optimize your TTFT today.