Papers
Topics
Authors
Recent
Search
2000 character limit reached

Time-to-First-Token (TTFT) Overview

Updated 7 April 2026
  • TTFT is a latency metric that measures the elapsed time from user prompt arrival to the emission of the first token, incorporating queue, prefill, and decode stages.
  • It highlights bottlenecks in transformer prefill, token processing, and scheduling that directly impact the user-perceived responsiveness of LLM applications.
  • Recent advancements such as dynamic token pruning, KV cache prediction, and advanced scheduling methodologies significantly reduce TTFT while maintaining model accuracy.

Time-to-First-Token (TTFT) is a key latency metric in LLM inference pipelines, quantifying the wall-clock duration from the instant a user prompt arrives at the model server to when the first output token is generated and returned downstream. TTFT subsumes all computation and scheduling within the prefill (prompt-processing) phase and is the principal determinant of perceived responsiveness in interactive LLM applications. Recent research targets both precise characterization and aggressive reduction of TTFT across various architectures, deployments, and optimization strategies.

1. Definition, Mathematical Formalism, and Variants

TTFT, denoted formally as

TTFT=tfirst_token−trequest_start,\mathrm{TTFT} = t_\mathrm{first\_token} - t_\mathrm{request\_start},

where trequest_startt_\mathrm{request\_start} represents the timestamp at request arrival (on server or gateway) and tfirst_tokent_\mathrm{first\_token} the moment the first output token is observed by client or system instrumentation. In LLM systems, TTFT typically decomposes as

TTFT≈Tqueue+Tprefill+Tdecode(1),\mathrm{TTFT} \approx T_\mathrm{queue} + T_\mathrm{prefill} + T_\mathrm{decode}(1),

where TqueueT_\mathrm{queue} is prefill-side waiting, TprefillT_\mathrm{prefill} the time to process all prompt tokens through the full transformer stack (building the KV cache), and Tdecode(1)T_\mathrm{decode}(1) the first autoregressive generation step—an order of magnitude smaller than TprefillT_\mathrm{prefill} except for extremely short prompts (Liu et al., 5 Feb 2025, Agrawal et al., 2024, Chen et al., 3 Oct 2025).

Extensions to TTFT in multimodal models (video, image, or multi-stage networks) incorporate modality-specific pre-processing latencies, e.g.,

TTFT(I)=Timage_enc(N)+Tprefill(N),\mathrm{TTFT}(I) = T_\mathrm{image\_enc}(N) + T_\mathrm{prefill}(N),

for MLLMs (multimodal LLMs) with vision transformer encoders and token count NN (Sun et al., 26 Nov 2025, Shao et al., 27 May 2025).

In PD-disaggregated (prefill/decode-separated) serving, TTFT is the sum: trequest_startt_\mathrm{request\_start}0 with prefill, network KV transfer, and decode startup contributions (Lai et al., 3 Dec 2025).

2. Bottlenecks in TTFT: Structural and Systemic Sources

The dominant TTFT cost is prompt processing (prefill). For sequence length trequest_startt_\mathrm{request\_start}1, transformer prefill scales as trequest_startt_\mathrm{request\_start}2 in self-attention and linearly in MLP (per layer, batch), with the entire prompt required before the first token can be emitted. This bottleneck is exacerbated for long-context LLMs, multi-modal systems with native-resolution visual tokens, and batch serving workloads (Agrawal et al., 2024, Liu et al., 5 Feb 2025, Sun et al., 26 Nov 2025, Fu et al., 2024, Long et al., 8 Aug 2025).

Key sources driving up TTFT include:

3. Algorithmic and Architectural Optimizations for TTFT Reduction

Recent research has produced a diverse array of TTFT reduction strategies, driven by architectural redesigns, proxy networks, scheduling reformulations, and token-level compression techniques. Core techniques are enumerated below.

Token Pruning, Merging, and Attention-Guided Prefill Abbreviation

Proxy/Predictor Models for KV Cache Bypass

  • KV Prediction: Accurate approximation of the full base model’s KV cache using a smaller auxiliary model, followed by direct decode using the predicted KV, achieves 2.2× wall-clock TTFT speedup and up to 4× reduction in prefill FLOPs (Horton et al., 2024).
  • Cascade/speculator designs: Small models or off-the-shelf LLMs select token subsets (SpecPrefill) or generate pseudo-KV caches to shortcut the base model prefill computation (Liu et al., 5 Feb 2025, Horton et al., 2024).

Advanced Scheduling and Resource Management

  • Preemptive, buffer-aware, and staggered scheduling: TokenFlow (Chen et al., 3 Oct 2025), Staggered Batch Scheduling (SBS) (Tian et al., 18 Dec 2025), Multi-stage Flow Scheduling (MFS) (Sun et al., 18 Mar 2026), and others avoid head-of-line blocking and enable fine-grained resource allocation, dynamically reordering workloads to eliminate queuing bubbles and synchronize microbatches on device. SBS and TokenFlow achieve up to 80.2% and 40% TTFT reduction, respectively (Chen et al., 3 Oct 2025, Tian et al., 18 Dec 2025).
  • Fair batching and slack-tracking: FairBatching (Lyu et al., 16 Oct 2025) introduces envelope-line SLO tracking, adaptive time-budgets, and dynamic reprioritization, reducing TTFT p99 by 2.29× versus Sarathi-style stall-free decoders, while preserving per-token throughput.
  • Deadline- and SLO-aware request admission: SCORPIO (2505.23022) introduces deadline-based queueing, LDF reordering, and predictive prefill-time modeling, increasing SLO attainment by ~46.5% for TTFT under burst.

System/Cluster Level and Hardware-Aware Techniques

  • Hybrid aggregation/disaggregation and instance type adaptation: TaiChi (Wang et al., 4 Aug 2025) and TokenScale (Lai et al., 3 Dec 2025) jointly tune "P-heavy" and "D-heavy" instance ratios, prefill chunk sizes, and deploy convertible decoders, maintaining TTFT within SLO across bursty production loads. TokenScale achieves mean TTFT ≈ 50 ms, p99 ≈ 600 ms on standard benchmarks.
  • Geo-distributed, latency-optimized load balancing: GORGO (Toniolo et al., 12 Feb 2026) uses additive cost models incorporating RTT, cache reuse, GPU queue depths, and prefix similarity to select optimal regions, reducing median TTFT by 2.5× and P99 by up to two orders of magnitude.
  • KV cache prediction, allocation, and migration: MC² (Shen et al., 17 Mar 2025), DiSCo (Sun et al., 17 Feb 2025), and other KV-centric optimizations proactively reserve and migrate KV allocations or dynamically schedule between device/server for TTFT reduction under tight memory and compute constraints.

4. Measurement Protocols, Metrics, and Pitfalls

TTFT is measured end-to-end from external request arrival to first-token emission, typically at the client or per-inference pipeline step. Black-box measurement uses high-resolution wall-clock timers; system-level profiling decomposes TTFT into queue, prefill, transfer, and decode startup stages (Agrawal et al., 2024, Lai et al., 3 Dec 2025, Wang et al., 4 Aug 2025).

Aggregation

Limitations and Caveats

  • TTFT is highly prompt-length-sensitive, scaling quadratically; SLOs must be reported versus prompt length (Agrawal et al., 2024).
  • System configuration (batching, scheduling, resource type) confounds comparison; a system may optimize TTFT at the expense of Time-Per-Output-Token (TPOT) or throughput (Wang et al., 4 Aug 2025, 2505.23022, Tian et al., 18 Dec 2025).
  • TTFT does not reflect streaming smoothness or inter-token fluidity, which motivates new metrics such as fluidity-index (Agrawal et al., 2024).
  • TTFT alone is insufficient for user experience evaluation, as stalls after first token are invisible to this metric.

5. Empirical Outcomes and Theoretical Guarantees

Recent studies report consistent and often large empirical TTFT reductions across diverse LLM and MLLM settings:

Approach Model/Context TTFT Speedup (×) Accuracy Retention Reference
SpecPrefill Llama-3.1 405B, LongBench 7.66 >95% (Liu et al., 5 Feb 2025)
LazyLLM Llama 2 7B, LongBench 1.3–2.9 >99% (Fu et al., 2024)
SlimInfer LLaMA3.1-8B-Inst, 32k tok 2.53 <0.5% loss (Long et al., 8 Aug 2025)
KV prediction (KVP) OpenELM 1.1B, CPU 2.2 +15–50% rel. (Horton et al., 2024)
ViT-UHD/PVC (MLLM) LLaVA-UHD v3 1.9–2.4 ≈ baseline (Sun et al., 26 Nov 2025)
HoliTom (video LLM) LLaVA-OneVision-7B 2.28 99.1% (Shao et al., 27 May 2025)
TokenFlow (sched) Llama3-8B, Qwen2.5-32B 1.75–6.5 (P99) Unchanged (Chen et al., 3 Oct 2025)
SBS (batch sched) Deepseek-V3, H800 1.3–1.4 >99% (Tian et al., 18 Dec 2025)
MC² (KV allocation) OPT-13B/175B/LLaMA2 2.83–3.29 SLO↑, throughput↑ (Shen et al., 17 Mar 2025)
GORGO (geo load-balance) Llama3.1-8B, 3 regions 2.5 (median P99) – (Toniolo et al., 12 Feb 2026)
SCORPIO (SLO guard) Llama3.1-8B, ShareGPT +46.5% SLO – (2505.23022)
Layered Prefill (MoE/LLM) Qwen3-30B, H100 up to 1.7 full-throughput (Lee et al., 9 Oct 2025)

Tables cite the largest or most representative speedup numbers reported; most studies provide breakdowns by prompt length or context size. Task-level breakdowns show TTFT gains are largest for long-context QA, retrieval, or summarization, while information-dense, per-token tasks may exhibit higher degradation at low keep rates (Liu et al., 5 Feb 2025, Fu et al., 2024, Long et al., 8 Aug 2025).

6. Architectural Patterns, Scheduling Policies, and Limitations

Dominant Patterns

Limitations and Open Problems

  • Quality trade-off in aggressive pruning: For information-dense or structured-prediction tasks, aggressive token dropping inevitably degrades performance; tasks requiring per-token output or logit-level access remain out-of-scope for current TTFT-pruning strategies (Liu et al., 5 Feb 2025, Fu et al., 2024).
  • Non-stationarity and heterogeneous SLOs: Optimization under variable-load and heterogeneous user SLOs is not fully solved; many existing policies require offline calibration or struggle under highly non-homogeneous traffic (2505.23022, Tian et al., 18 Dec 2025).
  • Complex deployment topologies: TTFT optimization in cross-region, multi-modal, or multi-stage pipelines may interact with load balancing, transfer overheads, and flow-level contention in complex ways; joint optimization frameworks (MFS, GORGO) are recent advances (Sun et al., 18 Mar 2026, Toniolo et al., 12 Feb 2026).

7. Practitioner Recommendations and Emerging Directions

  • Always report TTFT distributions as a function of prompt/context length (and modality) (Agrawal et al., 2024, Sun et al., 26 Nov 2025).
  • Instrument TTFT at fine granularity—decompose queueing, prefill, transfer, and decode contributions; only this surfaces true sources of latency and allows for actionable SLO definition (Agrawal et al., 2024, Wang et al., 4 Aug 2025, Lai et al., 3 Dec 2025).
  • Select TTFT optimization strategy to match workload: Pruning/token-dropping and speculator models excel in compressible, redundant, or long-context settings; scheduling and resource balancing are critical for highly bursty, multi-tenant loads.
  • Integrate TTFT-focused metrics with streaming smoothness metrics: TTFT alone does not capture user experience under token stalls; fluidity-index and per-token deadline compliance should augment TTFT for benchmarking (Agrawal et al., 2024).
  • Future directions: Adaptive, learned token-importance predictors; tighter integration of speculative decoding, KV-prediction, and prompt truncation; multi-stage, globally-aware scheduling; end-to-end TTFT-TPOT joint optimization (TaiChi, MFS); and real-time TTFT SLO negotiation per request (Liu et al., 5 Feb 2025, Lai et al., 3 Dec 2025, Sun et al., 18 Mar 2026, Wang et al., 4 Aug 2025).

References:

(Liu et al., 5 Feb 2025, Agrawal et al., 2024, Sun et al., 26 Nov 2025, Horton et al., 2024, Long et al., 8 Aug 2025, Fu et al., 2024, Shao et al., 27 May 2025, Sun et al., 17 Feb 2025, Shen et al., 17 Mar 2025, Lai et al., 3 Dec 2025, Tian et al., 18 Dec 2025, Wang et al., 4 Aug 2025, Chen et al., 3 Oct 2025, Lyu et al., 16 Oct 2025, Toniolo et al., 12 Feb 2026, Sun et al., 18 Mar 2026, 2505.23022, Lee et al., 9 Oct 2025, Cho et al., 12 Feb 2026, Dexter et al., 7 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Time-to-First-Token (TTFT).