aiperf · public dataset study

We measured __N__ LLM requests: __NET_%__% of "slow AI" was client network, not the model

Across __N__ streamed LLM requests, __NET_%__% of perceived end-to-end latency was client-side network (RTT, TLS handshake, queue wait) rather than model inference. Measured via aiperf's 5-lane race, __BACKENDS__, __HARDWARE__, __WINDOW__.

Key findings

__NET_%__%

of perceived latency was client network, not inference

__INF_%__%

attributable to model inference (prefill + decode)

__TPS_DROP_%__%

sustained TPS drop pushing 8K → 128K context tokens

The three numbers that define LLM responsiveness

TTFT (Time to First Token)
TTFT is the elapsed time between request submission and the arrival of the first streamed token. It is measured in milliseconds; under __TTFT_MS__ms is interactive-grade for chat workloads. Unlike TPS, TTFT isolates the prefill and queue-wait phase and is unaffected by generation length.
TPS (Tokens Per Second)
TPS is the sustained generation throughput after the first token, computed as output tokens divided by generation time. It is measured in tokens/second and reflects the decode phase. Unlike TTFT, TPS degrades under context saturation and high concurrency as the KV cache competes for GPU memory bandwidth.
ITL (Inter-Token Latency)
ITL is the time gap between two consecutive streamed tokens. It is measured in milliseconds per token; its variance (jitter) — not just its mean — determines whether streaming output feels smooth. Unlike average TPS, ITL exposes tail stalls caused by packet loss, batch preemption, or scheduler contention.

Serving-engine comparison

Serving EngineContinuous BatchingKV Cache StrategyQuantizationBest For
vLLMToken-level continuousPagedAttentionAWQ, FP8High-concurrency chat
TensorRT-LLMIn-flightPaged KVFP8, INT8Lowest TTFT on NVIDIA
TGIContinuousBlock-basedEETQ, GPTQHF ecosystem ops
Triton + vLLMEnsembleBackend-delegatedBackendMixed-model fleets

Methodology

  • Measurement technique: client-side timestamping of each streamed token over WebSocket, isolating RTT and TLS handshake from server-side prefill and decode.
  • Sample: __N__ runs across __BACKENDS__.
  • Hardware: __HARDWARE__.
  • Window: __WINDOW__. Dataset last updated __ISO_DATE__.
  • Limitations: client network conditions are self-reported by the measuring browser; results are not controlled for ISP peering or geographic routing. Reruns under your own tuning are welcome — the dataset is published under CC BY 4.0.