AIPerf · public dataset study

We measured N LLM requests: __NET_%__% of "slow AI" was client network, not the model

Across __N__ streamed LLM requests, __NET_%__% of perceived end-to-end latency was client-side network (RTT, TLS handshake, queue wait) rather than model inference. Measured via AIPerf's live speed test, __BACKENDS__, __HARDWARE__, __WINDOW__.

Download raw dataset (CSV · CC BY 4.0)Methodology

Key findings

__NET_%__%

of perceived latency was client network, not inference

__INF_%__%

attributable to model inference (prefill + decode)

−__TPS_DROP_%__%

sustained TPS drop pushing 8K → 128K context tokens

The three numbers that define LLM responsiveness

TTFT (Time to First Token): TTFT is the elapsed time between request submission and the arrival of the first streamed token. It is measured in milliseconds; under __TTFT_MS__ms is interactive-grade for chat workloads. Unlike TPS, TTFT isolates the prefill and queue-wait phase and is unaffected by generation length.
TPS (Tokens Per Second): TPS is the sustained generation throughput after the first token, computed as output tokens divided by generation time. It is measured in tokens/second and reflects the decode phase. Unlike TTFT, TPS degrades under context saturation and high concurrency as the KV cache competes for GPU memory bandwidth.
ITL (Inter-Token Latency): ITL is the time gap between two consecutive streamed tokens. It is measured in milliseconds per token; its variance (jitter) — not just its mean — determines whether streaming output feels smooth. Unlike average TPS, ITL exposes tail stalls caused by packet loss, batch preemption, or scheduler contention.

Serving-engine comparison

Serving Engine	Continuous Batching	KV Cache Strategy	Quantization	Best For
vLLM	Token-level continuous	PagedAttention	AWQ, FP8	High-concurrency chat
TensorRT-LLM	In-flight	Paged KV	FP8, INT8	Lowest TTFT on NVIDIA
TGI	Continuous	Block-based	EETQ, GPTQ	HF ecosystem ops
Triton + vLLM	Ensemble	Backend-delegated	Backend	Mixed-model fleets

Methodology

Measurement technique: client-side timestamping of each streamed token, isolating RTT and TLS handshake from server-side prefill and decode.
Sample: __N__ runs across __BACKENDS__.
Hardware: __HARDWARE__.
Window: __WINDOW__. Dataset last updated __ISO_DATE__.
Limitations: client network conditions are self-reported by the measuring browser; results are not controlled for ISP peering or geographic routing. Reruns under your own tuning are welcome — the dataset is published under CC BY 4.0.