aiperf · public dataset study
We measured __N__ LLM requests: __NET_%__% of "slow AI" was client network, not the model
Across __N__ streamed LLM requests, __NET_%__% of perceived end-to-end latency was client-side network (RTT, TLS handshake, queue wait) rather than model inference. Measured via aiperf's 5-lane race, __BACKENDS__, __HARDWARE__, __WINDOW__.
Key findings
__NET_%__%
of perceived latency was client network, not inference
__INF_%__%
attributable to model inference (prefill + decode)
−__TPS_DROP_%__%
sustained TPS drop pushing 8K → 128K context tokens
The three numbers that define LLM responsiveness
- TTFT (Time to First Token)
- TTFT is the elapsed time between request submission and the arrival of the first streamed token. It is measured in milliseconds; under __TTFT_MS__ms is interactive-grade for chat workloads. Unlike TPS, TTFT isolates the prefill and queue-wait phase and is unaffected by generation length.
- TPS (Tokens Per Second)
- TPS is the sustained generation throughput after the first token, computed as output tokens divided by generation time. It is measured in tokens/second and reflects the decode phase. Unlike TTFT, TPS degrades under context saturation and high concurrency as the KV cache competes for GPU memory bandwidth.
- ITL (Inter-Token Latency)
- ITL is the time gap between two consecutive streamed tokens. It is measured in milliseconds per token; its variance (jitter) — not just its mean — determines whether streaming output feels smooth. Unlike average TPS, ITL exposes tail stalls caused by packet loss, batch preemption, or scheduler contention.
Serving-engine comparison
| Serving Engine | Continuous Batching | KV Cache Strategy | Quantization | Best For |
|---|---|---|---|---|
| vLLM | Token-level continuous | PagedAttention | AWQ, FP8 | High-concurrency chat |
| TensorRT-LLM | In-flight | Paged KV | FP8, INT8 | Lowest TTFT on NVIDIA |
| TGI | Continuous | Block-based | EETQ, GPTQ | HF ecosystem ops |
| Triton + vLLM | Ensemble | Backend-delegated | Backend | Mixed-model fleets |
Methodology
- Measurement technique: client-side timestamping of each streamed token over WebSocket, isolating RTT and TLS handshake from server-side prefill and decode.
- Sample: __N__ runs across __BACKENDS__.
- Hardware: __HARDWARE__.
- Window: __WINDOW__. Dataset last updated __ISO_DATE__.
- Limitations: client network conditions are self-reported by the measuring browser; results are not controlled for ISP peering or geographic routing. Reruns under your own tuning are welcome — the dataset is published under CC BY 4.0.