Time to First Token (TTFT) in LLM Inference
- TTFT is a performance metric that quantifies the latency from the generation request to the first output token, incorporating both scheduling and prompt processing delays.
- It reflects the combined impact of system load, prompt length, and hardware constraints on LLM inference, highlighting key challenges in scaling for real-time applications.
- Optimization strategies such as dynamic token pruning, auxiliary cache prediction, and fluidity-index analysis are employed to enhance TTFT and overall service responsiveness.
Time to First Token (TTFT) is a critical performance metric in LLM inference, defined as the latency from the arrival of a generation request to the issuance of the first output token. TTFT encompasses both prompt scheduling delay and the computational cost of processing the prompt (often termed the prefill stage), and is closely scrutinized in both academic research and real-world deployments due to its direct impact on perceived responsiveness in interactive applications such as chat and live translation. However, recent work has revealed that TTFT, though widely reported, does not fully capture the real-time quality of service (QoS) experienced by end users unless paired with fine-grained, temporally aware analysis of token streaming and broader systems considerations.
1. Definition, Components, and Limitations of TTFT
TTFT is the total latency observed from the moment a user’s request is submitted until the LLM generates its first output token. This spans two distinct intervals:
- Scheduling delay: Time spent by the system before prompt processing begins, including resource allocation and queuing.
- Prompt processing time: The computational cost of executing the full forward pass on the prompt (prefill), typically scaling quadratically with prompt length.
Mathematically, for a given query, , where depends on both the model architecture and prompt length.
A key limitation is that TTFT, evaluated as a static scalar, conflates the two sources of latency and fails to distinguish whether system load (e.g., GPU memory contention, scheduling policy) or inherent model cost dominates. For instance, a normalized TTFT (divided naively by prompt length) fails to separate scheduling latency from computational scaling. As such, using a static TTFT target for Quality of Service (QoS) is only practical for tightly bounded prompt lengths or homogeneous workloads. This motivates the development of more nuanced performance evaluation approaches (Agrawal et al., 9 Jul 2024).
2. TTFT in the Broader Metric Ecosystem
TTFT is one among several latency metrics:
Metric | Definition | Sensitivity to Prompt Length / System Load |
---|---|---|
TTFT | Time from request arrival to first token output | High (both prompt length and system factors) |
TBT | Time-Between-Tokens; latency between successive tokens | Low to moderate (primarily system factors, less prompt size) |
TPOT | Time-Per-Output-Token; aggregate per-token generation latency | System architecture and contention |
Fluidity-Index | Fraction of tokens that meet pre-designated generation deadlines | Highly granular, user-experience-aligned |
Conventional evaluation frameworks, including Metron (Etalon), caution against relying solely on TTFT or any single latency metric to assess user experience (Agrawal et al., 9 Jul 2024). While TTFT captures initial responsiveness, it gives no insight into consistency (jitter, stalls) during subsequent streaming; conversely, metrics such as fluidity-index provide a deadline-based analysis (see §4) that better reflects smooth user-facing streaming performance.
3. System Design Strategies for TTFT Optimization
Numerous system-level approaches target TTFT minimization by reducing either scheduling or prefill computational delays:
- Dynamic token pruning (e.g., LazyLLM (Fu et al., 19 Jul 2024), FastKV (Jo et al., 3 Feb 2025), SlimInfer (Long et al., 8 Aug 2025), SpecPrefill (Liu et al., 5 Feb 2025)) reduces prefill compute by aggressively dropping context tokens deemed unimportant for next-token prediction. LazyLLM, for instance, computes attention-driven importance scores at each transformer layer:
Tokens below a dynamic top- threshold are pruned, yielding 2.34× TTFT speedup with negligible accuracy loss (Fu et al., 19 Jul 2024).
- Auxiliary cache prediction (KV Prediction (Horton et al., 10 Oct 2024)) replaces full-prompt processing via a small auxiliary transformer that quickly approximates the base model’s KV cache for the prompt. Each base model layer’s cache is projected from the auxiliary layer:
This method achieves FLOP- and hardware-level TTFT reductions (e.g., batch size 8: 5.59s → 3.17s) without full-model execution during prefill.
- Cache management (LayerKV (Xiong et al., 1 Oct 2024), CacheOPT (Shen et al., 17 Mar 2025)) addresses memory contention by partitioning, offloading, and reusing KV caches at a finer granularity (layer- or length-aware). LayerKV introduces proactive offloading of non-critical layers to CPU memory, releasing GPU resources for new prefill demands, yielding up to 69× average TTFT reduction.
- Serving architecture tuning (TaiChi (Wang et al., 4 Aug 2025), layered prefill (Lee et al., 9 Oct 2025), TokenFlow (Chen et al., 3 Oct 2025), SCORPIO (2505.23022)) focuses on dynamic resource allocation and scheduling—using, for instance, aggregating prefill and decode on “P-heavy” instances to minimize TTFT or vertical partitioning of layers (rather than tokens) to cut redundant MoE expert weight loading, leading to up to 70% TTFT reduction (Lee et al., 9 Oct 2025).
- Device-server cooperation (DiSCo (Sun et al., 17 Feb 2025)) adaptively routes requests and migrates generation between endpoints to balance the linear scaling of on-device TTFT with the unpredictable distribution on the server side.
4. Advanced Temporal Analysis: Fluidity-Index and Beyond
To address the limitations of TTFT alone, the fluidity-index evaluates token streaming more holistically. Each token's deadline is set as
where is the prefill TTFT target and the per-token target (TBT). The index reports the fraction of tokens that meet these rolling deadlines; deadline misses “reset” subsequent token deadlines, accurately modeling user-perceived stalls (Agrawal et al., 9 Jul 2024).
If is the generation time of token , and denotes accumulated early arrivals, deadline fulfiLLMent is checked as:
- If : on time, update slack.
- Else: compute number of misses as
and reset to zero.
This metric, especially when used in conjunction with TTFT, enables frameworks (e.g., Metron/Etalon) to define a fluid token generation rate for service-level objectives—ensuring both quick system startup and a consistently smooth user experience, impervious to short bursts of underlying jitter.
5. Impact of Hardware, Model, and Serving System Choices
TTFT is highly sensitive to hardware allocation, GPU/CPU memory contention, and serving stack design:
- Prompt length growth: TTFT scales quadratically with prompt length in vanilla transformers, magnifying its impact for retrieval-augmented or few-shot prompting settings.
- Layerwise offloading (LayerKV, CacheOPT) mitigates queuing delays by asynchronously offloading KV data, allowing new prefill requests to commence with minimal wait times (Xiong et al., 1 Oct 2024, Shen et al., 17 Mar 2025).
- Scheduling policy: Techniques such as least-deadline-first (SCORPIO) or buffer-aware, preemptive scheduling (TokenFlow) directly reorder or suspend request processing to ensure time-critical first tokens are delivered preferentially (2505.23022, Chen et al., 3 Oct 2025).
- Disaggregation/aggregation hybridization (TaiChi) allows systems to adaptively assign prefill tasks to instances optimized for TTFT under evolving workload SLOs (Wang et al., 4 Aug 2025).
- KV cache competition management: Accurate output length prediction (using a confidence-based adaptive padding with Hoeffding’s inequality) improves both cache allocation and minimizes preemption costs, resulting in up to 2.83× lower tail TTFT (Shen et al., 17 Mar 2025).
6. Role of Attention Sinks and Model Inductive Biases
Recent theoretical and empirical work has uncovered that LLMs often concentrate a major fraction of attention on the first token (the “attention sink”), which functions as a control mechanism to prevent over-mixing of representations (rank and representational collapse) in deep or long-context transformers (Barbero et al., 3 Apr 2025). This behavior stabilizes information propagation through the network, especially for long contexts. While not directly altering TTFT, this learned bias facilitates robust first token handling by restricting perturbation propagation—a foundation that many cache-saving and pruning techniques build upon.
7. Broader Implications and Future Directions
TTFT is indispensable for real-time LLM-powered applications, but as prompt lengths and model/user concurrency accelerate, it is only one dimension of a multi-faceted performance landscape. Future research will likely revolve around:
- Joint TTFT–TBT–throughput optimization (e.g., leveraging deadline-aware metrics, fluidity-index, and SLO-driven scheduling).
- Further development of device-edge-server cooperative serving schemes to blend predictable on-device TTFT scaling with volatile server batching effects (Sun et al., 17 Feb 2025).
- Dynamic, context-aware token selection and cache management, possibly via hybrid or learned schemes that adjust propagation length and cache granularity on the fly (Jo et al., 3 Feb 2025, Long et al., 8 Aug 2025).
- Universally applicable, fine-grained performance metrics (e.g., fluidity-index) that align more closely with ultimate user experience in diverse network, device, and workload scenarios.
In conclusion, TTFT remains the canonical measure for initial model responsiveness but must be contextualized within a holistic temporal framework (such as fluidity-index) and understood as a system-level outcome contingent on architectural, scheduling, and workload factors. Contemporary and emerging LLM inference frameworks increasingly integrate TTFT-minimization into broader multi-objective optimization, reflecting its central role in high-throughput, low-latency model deployment.