True Time-to-First-Token (TTFT)
- True Time-to-First-Token (TTFT) is a latency metric that quantifies the delay from an inference request to the first output token by summing scheduling, queuing, and prompt processing times.
- Architectural factors like KV cache management, dynamic token selection, and SLO-aware scheduling can reduce TTFT significantly, with improvements reported up to 69×.
- Evaluating TTFT alongside metrics such as TBT, TPOT, and fluidity-index is essential for a comprehensive assessment of system performance and user experience.
True Time-to-First-Token (TTFT) is a performance metric that quantifies the latency from the point a LLM inference request is issued until the system produces the first output token. TTFT is central to evaluating system responsiveness in both research and production settings, as it reflects the combined scheduling, queuing, and prompt processing delays prior to token streaming. While TTFT reliably indicates initial system reactivity, its interpretation and impact are significantly influenced by architectural choices, queuing strategies, resource scheduling, prompt length, and caching mechanisms. Recent advances, as documented in technical literature and system benchmarks, have clarified TTFT’s computational contributors and its nuanced role in holistic performance evaluation.
1. Formal Definition and Computational Contributors
TTFT is strictly defined as the sum of all latencies incurred between request arrival and the emission of the first generated token. Formally, the TTFT for a given request may be expressed as:
where:
- is the time associated with resource allocation and scheduling,
- covers time spent waiting behind other inflight requests,
- represents the computation time to process the prompt and generate/cache all Key-Value (KV) representations required for the first forward pass of the transformer.
In the context of prompt processing, TTFT is often heavily influenced by the quadratic time complexity of self-attention with respect to prompt length, as demonstrated in benchmarks and system evaluations (Agrawal et al., 9 Jul 2024, Xiong et al., 1 Oct 2024, Kumar et al., 22 Aug 2025). For models supporting context windows of variable length, TTFT can scale superlinearly unless mitigated by alternative scheduling or token selection algorithms.
2. TTFT Versus Other Latency Metrics
TTFT captures only the initial system latency, terminating at the emission of the first output token. It does not account for subsequent streaming performance or possible stalls in the autoregressive decode phase. This can result in incomplete or misleading assessments if TTFT is used as the sole user-experience metric.
Other standard latency metrics include:
- Time Between Tokens (TBT): Measures the delay between subsequent output tokens in a streaming context (Agrawal et al., 9 Jul 2024).
- Time Per Output Token (TPOT): Normalizes total decode time by the number of generated tokens, averaging system responsiveness but potentially hiding intermittent jitter or stalls.
- Fluidity-index: Introduces token-level deadlines to capture streaming consistency; TTFT serves here as the initial deadline offset (Agrawal et al., 9 Jul 2024).
A comparative summary appears below:
Metric | Capture Scope | Sensitivity to Prompt Length | Streaming Quality Coverage |
---|---|---|---|
TTFT | Prefill → First token | High | None |
TBT | Each decode token | Low | Per-token |
TPOT | All output tokens | Moderate | Averages total decode |
Fluidity-idx | Tokenwise + deadlines | High for initial, full for streaming | Yes |
TTFT is indispensable for quantifying the system’s initial responsiveness but must be considered in conjunction with streaming metrics for a holistic evaluation.
3. Architectural Factors Affecting TTFT
Several architectural and algorithmic choices affect TTFT:
- KV Cache Management: The method of KV cache allocation is critical. Layer-wise KV allocation and dynamic offloading (as in LayerKV (Xiong et al., 1 Oct 2024)) significantly reduce queuing delays and resource contention, yielding TTFT improvements up to 69×.
- Token Pruning and Selection: Algorithms such as LazyLLM’s dynamic token pruning (Fu et al., 19 Jul 2024), SlimInfer’s block-level dynamic pruning (Long et al., 8 Aug 2025), and FastKV’s Token-Selective Propagation (Jo et al., 3 Feb 2025) decrease prompt processing time by restricting computation to only those tokens with high predictive utility for the first output.
- Scheduling: Advanced batch construction and SLO-aware scheduling such as in SLAI (Bari et al., 1 Aug 2025), SCORPIO (2505.23022), and TaiChi (Wang et al., 4 Aug 2025) can reorder requests or partition compute resources in ways that reduce aggregate TTFT under heavy load, balancing TTFT against TPOT constraints.
- Mixture-of-Experts (MoE): MoE routing, as used in GPT-OSS-20B (Kumar et al., 22 Aug 2025), adds overhead to TTFT due to expert selection, increasing initial latency relative to dense models despite improved throughput and energy efficiency in subsequent token generation.
4. Algorithmic and System-Level Optimizations
Recent research details various approaches for TTFT optimization:
- Position-Independent Caching (PIC): EPIC (Hu et al., 20 Oct 2024) uses chunked KV cache modularity and selective recomputation at chunk boundaries, reducing TTFT by up to 8× via an O(kN) linking operation.
- KV Prediction: Employing a small auxiliary model to approximate base-model KV caches enables dramatic reduction of prompt processing cost, as formalized by
for large N, versus in full base-model inference (Horton et al., 10 Oct 2024).
- Speculative Prefill: Utilizes a lightweight speculator for token importance estimation and selection, resulting in up to 7.66× TTFT speedup without training overhead (Liu et al., 5 Feb 2025).
- Asynchronous KV Swapping: SlimInfer (Long et al., 8 Aug 2025) exploits dynamic block pruning and overlapped CUDA stream operations for concurrent cache transfer and computation, reducing both TTFT and memory footprint.
- Cache Competition Mitigation: Systems such as CacheOPT (Shen et al., 17 Mar 2025) use confidence-weighted output length estimation, demand-based cache allocation, proactive reserve strategies, and adaptive preemption to reduce tail TTFT and increase SLO-adherent throughput.
5. Impact of TTFT on User Experience and System SLOs
TTFT is a primary determinant of perceived system responsiveness, especially in interactive applications such as chatbots, IDE assistants, and streaming services. Tail latencies (e.g., TTFT) are strongly correlated with negative user experience and reduced system goodput. SLO-oriented schedulers explicitly prioritize requests facing imminent TTFT deadline violations, use predictive models for prefill completion estimation, and may reject unattainable requests to circumvent backlog-induced delays (2505.23022, Sun et al., 17 Feb 2025).
In collaborative device-server settings, such as DiSCo (Sun et al., 17 Feb 2025), TTFT is managed via cost-aware global policies and dynamic migration between endpoints, incorporating real-world measurements to justify device or server execution routes. TTFT reductions ranging from 6% to 78% are reported, with cost savings and QoE improvements.
6. Limitations and Nuanced Interpretation
TTFT can be pathologically insensitive to later decode stalls or to variable streaming consistency. Fixed TTFT SLOs are often impractical for long prompts. There exists a strong dependency on prompt length, architectural internals (MoE, routing), and actual system configuration (chunk sizes, accelerator bandwidth, cache partitioning). TTFT should be interpreted within its operational context, ideally accompanied by complementary metrics such as fluidity-index, TPOT, and empirical p50/p95 percentiles.
Findings from the deployment-centric paper of GPT-OSS-20B (Kumar et al., 22 Aug 2025) highlight the trade-off between initial TTFT and sustained throughput: higher TTFT is tolerated when overall system efficiency (as measured by tokens per second and energy per token) compensates for initial delay. Thus, TTFT evaluation is most informative when used in conjunction with system-level efficiency, streaming consistency, and energy metrics.
7. Future Directions and Open Questions
Emerging research seeks further TTFT reduction via:
- Finer-grained token selection and compression methods (Horton et al., 10 Oct 2024, Jo et al., 3 Feb 2025).
- Multi-endpoint collaboration and cost-aware adaptive routing (Sun et al., 17 Feb 2025).
- Unified aggregation–disaggregation architectures (TaiChi) that allow real-time adjustment of prefill/decode priorities to optimize goodput under heterogeneous SLO regimes.
- Dynamic reconfiguration of model internals (e.g., active vs. inactive parameter sets in MoE) and integration with upcoming NPU-level scheduling (Chen et al., 1 Aug 2025).
A persistent challenge is the robust prediction of TTFT for diverse prompt lengths and models under real-world concurrent loads. Ongoing work combines analytic modeling, neural predictors, and adaptive batch construction, aiming for predictable, low-latency TTFT and user-perceived QoE.
True Time-to-First-Token, as an LLM inference metric, remains vital for assessing system responsiveness but should be understood as part of a broader constellation of latency, throughput, and user experience measurements. Its computation, optimization, and limitation are active areas of research, with techniques spanning token-level algorithmics, resource scheduling, cache design, and integrated service layer adaptation.