Time-to-First-Token (TTFT) Overview
- TTFT is a latency metric that measures the elapsed time from user prompt arrival to the emission of the first token, incorporating queue, prefill, and decode stages.
- It highlights bottlenecks in transformer prefill, token processing, and scheduling that directly impact the user-perceived responsiveness of LLM applications.
- Recent advancements such as dynamic token pruning, KV cache prediction, and advanced scheduling methodologies significantly reduce TTFT while maintaining model accuracy.
Time-to-First-Token (TTFT) is a key latency metric in LLM inference pipelines, quantifying the wall-clock duration from the instant a user prompt arrives at the model server to when the first output token is generated and returned downstream. TTFT subsumes all computation and scheduling within the prefill (prompt-processing) phase and is the principal determinant of perceived responsiveness in interactive LLM applications. Recent research targets both precise characterization and aggressive reduction of TTFT across various architectures, deployments, and optimization strategies.
1. Definition, Mathematical Formalism, and Variants
TTFT, denoted formally as
where represents the timestamp at request arrival (on server or gateway) and the moment the first output token is observed by client or system instrumentation. In LLM systems, TTFT typically decomposes as
where is prefill-side waiting, the time to process all prompt tokens through the full transformer stack (building the KV cache), and the first autoregressive generation step—an order of magnitude smaller than except for extremely short prompts (Liu et al., 5 Feb 2025, Agrawal et al., 2024, Chen et al., 3 Oct 2025).
Extensions to TTFT in multimodal models (video, image, or multi-stage networks) incorporate modality-specific pre-processing latencies, e.g.,
for MLLMs (multimodal LLMs) with vision transformer encoders and token count (Sun et al., 26 Nov 2025, Shao et al., 27 May 2025).
In PD-disaggregated (prefill/decode-separated) serving, TTFT is the sum: 0 with prefill, network KV transfer, and decode startup contributions (Lai et al., 3 Dec 2025).
2. Bottlenecks in TTFT: Structural and Systemic Sources
The dominant TTFT cost is prompt processing (prefill). For sequence length 1, transformer prefill scales as 2 in self-attention and linearly in MLP (per layer, batch), with the entire prompt required before the first token can be emitted. This bottleneck is exacerbated for long-context LLMs, multi-modal systems with native-resolution visual tokens, and batch serving workloads (Agrawal et al., 2024, Liu et al., 5 Feb 2025, Sun et al., 26 Nov 2025, Fu et al., 2024, Long et al., 8 Aug 2025).
Key sources driving up TTFT include:
- Compute-bound prefill: For long prompts or vision/video applications, attention and MLP layers dominate (Liu et al., 5 Feb 2025, Sun et al., 26 Nov 2025, Shao et al., 27 May 2025).
- Batch scheduling and queueing: Non-preemptive or naive immediate-dispatch policies induce in-engine queues and head-of-line blocking, accumulating queueing delay within device or across nodes (Tian et al., 18 Dec 2025, Chen et al., 3 Oct 2025, Lyu et al., 16 Oct 2025, Shen et al., 17 Mar 2025).
- Network transfer and system architecture: Disaggregated PD architectures and geo-distributed deployments add inter-GPU or inter-region transfer times, making TTFT sensitive to network contention and scheduling (Lai et al., 3 Dec 2025, Sun et al., 18 Mar 2026, Toniolo et al., 12 Feb 2026).
- Resource contention: Memory (KV cache), system bus saturation, and mixture-of-experts (MoE) weight reloads contribute to TTFT via hardware contention and inefficient prefill scheduling (Lee et al., 9 Oct 2025, Shen et al., 17 Mar 2025, 2505.23022).
- Prompt-length and context-size effects: TTFT is highly sensitive (often super-linear) in prompt length—practitioners must report TTFT as a function of prompt/context length, not a single aggregate (Agrawal et al., 2024, Wang et al., 4 Aug 2025, Fu et al., 2024, Long et al., 8 Aug 2025).
3. Algorithmic and Architectural Optimizations for TTFT Reduction
Recent research has produced a diverse array of TTFT reduction strategies, driven by architectural redesigns, proxy networks, scheduling reformulations, and token-level compression techniques. Core techniques are enumerated below.
Token Pruning, Merging, and Attention-Guided Prefill Abbreviation
- Dynamic token pruning and token importance estimation: Techniques such as Speculative Prefill (Liu et al., 5 Feb 2025), LazyLLM (Fu et al., 2024), SlimInfer (Long et al., 8 Aug 2025), HoliTom (video) (Shao et al., 27 May 2025), and Progressive Visual Compression (PVC) for MLLMs (Sun et al., 26 Nov 2025) eliminate or merge tokens in prefill using importance scoring, attention footprints, or layerwise information-diffusion criteria. This reduces the effective 3 in 4 prefill cost, yielding up to 7.66× TTFT speedups at minimal accuracy loss (10% kept tokens for SpecPrefill) (Liu et al., 5 Feb 2025), 2.53× for SlimInfer (Long et al., 8 Aug 2025), and 2.4×–2.28× for ViT-UHD/HoliTom in MLLMs (Sun et al., 26 Nov 2025, Shao et al., 27 May 2025).
Proxy/Predictor Models for KV Cache Bypass
- KV Prediction: Accurate approximation of the full base model’s KV cache using a smaller auxiliary model, followed by direct decode using the predicted KV, achieves 2.2× wall-clock TTFT speedup and up to 4× reduction in prefill FLOPs (Horton et al., 2024).
- Cascade/speculator designs: Small models or off-the-shelf LLMs select token subsets (SpecPrefill) or generate pseudo-KV caches to shortcut the base model prefill computation (Liu et al., 5 Feb 2025, Horton et al., 2024).
Advanced Scheduling and Resource Management
- Preemptive, buffer-aware, and staggered scheduling: TokenFlow (Chen et al., 3 Oct 2025), Staggered Batch Scheduling (SBS) (Tian et al., 18 Dec 2025), Multi-stage Flow Scheduling (MFS) (Sun et al., 18 Mar 2026), and others avoid head-of-line blocking and enable fine-grained resource allocation, dynamically reordering workloads to eliminate queuing bubbles and synchronize microbatches on device. SBS and TokenFlow achieve up to 80.2% and 40% TTFT reduction, respectively (Chen et al., 3 Oct 2025, Tian et al., 18 Dec 2025).
- Fair batching and slack-tracking: FairBatching (Lyu et al., 16 Oct 2025) introduces envelope-line SLO tracking, adaptive time-budgets, and dynamic reprioritization, reducing TTFT p99 by 2.29× versus Sarathi-style stall-free decoders, while preserving per-token throughput.
- Deadline- and SLO-aware request admission: SCORPIO (2505.23022) introduces deadline-based queueing, LDF reordering, and predictive prefill-time modeling, increasing SLO attainment by ~46.5% for TTFT under burst.
System/Cluster Level and Hardware-Aware Techniques
- Hybrid aggregation/disaggregation and instance type adaptation: TaiChi (Wang et al., 4 Aug 2025) and TokenScale (Lai et al., 3 Dec 2025) jointly tune "P-heavy" and "D-heavy" instance ratios, prefill chunk sizes, and deploy convertible decoders, maintaining TTFT within SLO across bursty production loads. TokenScale achieves mean TTFT ≈ 50 ms, p99 ≈ 600 ms on standard benchmarks.
- Geo-distributed, latency-optimized load balancing: GORGO (Toniolo et al., 12 Feb 2026) uses additive cost models incorporating RTT, cache reuse, GPU queue depths, and prefix similarity to select optimal regions, reducing median TTFT by 2.5× and P99 by up to two orders of magnitude.
- KV cache prediction, allocation, and migration: MC² (Shen et al., 17 Mar 2025), DiSCo (Sun et al., 17 Feb 2025), and other KV-centric optimizations proactively reserve and migrate KV allocations or dynamically schedule between device/server for TTFT reduction under tight memory and compute constraints.
4. Measurement Protocols, Metrics, and Pitfalls
TTFT is measured end-to-end from external request arrival to first-token emission, typically at the client or per-inference pipeline step. Black-box measurement uses high-resolution wall-clock timers; system-level profiling decomposes TTFT into queue, prefill, transfer, and decode startup stages (Agrawal et al., 2024, Lai et al., 3 Dec 2025, Wang et al., 4 Aug 2025).
Aggregation
- Median (P50), P90, P95, P99 percentiles of TTFT are universally reported for SLO evaluation and to capture tail distribution, as TTFT can vary by several orders of magnitude under burst or queuing conditions (Tian et al., 18 Dec 2025, 2505.23022, Toniolo et al., 12 Feb 2026).
Limitations and Caveats
- TTFT is highly prompt-length-sensitive, scaling quadratically; SLOs must be reported versus prompt length (Agrawal et al., 2024).
- System configuration (batching, scheduling, resource type) confounds comparison; a system may optimize TTFT at the expense of Time-Per-Output-Token (TPOT) or throughput (Wang et al., 4 Aug 2025, 2505.23022, Tian et al., 18 Dec 2025).
- TTFT does not reflect streaming smoothness or inter-token fluidity, which motivates new metrics such as fluidity-index (Agrawal et al., 2024).
- TTFT alone is insufficient for user experience evaluation, as stalls after first token are invisible to this metric.
5. Empirical Outcomes and Theoretical Guarantees
Recent studies report consistent and often large empirical TTFT reductions across diverse LLM and MLLM settings:
| Approach | Model/Context | TTFT Speedup (×) | Accuracy Retention | Reference |
|---|---|---|---|---|
| SpecPrefill | Llama-3.1 405B, LongBench | 7.66 | >95% | (Liu et al., 5 Feb 2025) |
| LazyLLM | Llama 2 7B, LongBench | 1.3–2.9 | >99% | (Fu et al., 2024) |
| SlimInfer | LLaMA3.1-8B-Inst, 32k tok | 2.53 | <0.5% loss | (Long et al., 8 Aug 2025) |
| KV prediction (KVP) | OpenELM 1.1B, CPU | 2.2 | +15–50% rel. | (Horton et al., 2024) |
| ViT-UHD/PVC (MLLM) | LLaVA-UHD v3 | 1.9–2.4 | ≈ baseline | (Sun et al., 26 Nov 2025) |
| HoliTom (video LLM) | LLaVA-OneVision-7B | 2.28 | 99.1% | (Shao et al., 27 May 2025) |
| TokenFlow (sched) | Llama3-8B, Qwen2.5-32B | 1.75–6.5 (P99) | Unchanged | (Chen et al., 3 Oct 2025) |
| SBS (batch sched) | Deepseek-V3, H800 | 1.3–1.4 | >99% | (Tian et al., 18 Dec 2025) |
| MC² (KV allocation) | OPT-13B/175B/LLaMA2 | 2.83–3.29 | SLO↑, throughput↑ | (Shen et al., 17 Mar 2025) |
| GORGO (geo load-balance) | Llama3.1-8B, 3 regions | 2.5 (median P99) | – | (Toniolo et al., 12 Feb 2026) |
| SCORPIO (SLO guard) | Llama3.1-8B, ShareGPT | +46.5% SLO | – | (2505.23022) |
| Layered Prefill (MoE/LLM) | Qwen3-30B, H100 | up to 1.7 | full-throughput | (Lee et al., 9 Oct 2025) |
Tables cite the largest or most representative speedup numbers reported; most studies provide breakdowns by prompt length or context size. Task-level breakdowns show TTFT gains are largest for long-context QA, retrieval, or summarization, while information-dense, per-token tasks may exhibit higher degradation at low keep rates (Liu et al., 5 Feb 2025, Fu et al., 2024, Long et al., 8 Aug 2025).
6. Architectural Patterns, Scheduling Policies, and Limitations
Dominant Patterns
- Prefill/Decode separation: Decoupling these phases exposes fine-grained scaling, queue management, and scheduling opportunities (disaggregation, convertible decoders, chunked/layered prefill) (Lai et al., 3 Dec 2025, Lee et al., 9 Oct 2025, Wang et al., 4 Aug 2025, Tian et al., 18 Dec 2025).
- KV cache management: Proactive, migration-based, or predictor-free policies (MC², DiSCo, SlimInfer) reduce KV-induced queuing and memory interference, key to keeping TTFT low under high load (Shen et al., 17 Mar 2025, Sun et al., 17 Feb 2025, Long et al., 8 Aug 2025).
Limitations and Open Problems
- Quality trade-off in aggressive pruning: For information-dense or structured-prediction tasks, aggressive token dropping inevitably degrades performance; tasks requiring per-token output or logit-level access remain out-of-scope for current TTFT-pruning strategies (Liu et al., 5 Feb 2025, Fu et al., 2024).
- Non-stationarity and heterogeneous SLOs: Optimization under variable-load and heterogeneous user SLOs is not fully solved; many existing policies require offline calibration or struggle under highly non-homogeneous traffic (2505.23022, Tian et al., 18 Dec 2025).
- Complex deployment topologies: TTFT optimization in cross-region, multi-modal, or multi-stage pipelines may interact with load balancing, transfer overheads, and flow-level contention in complex ways; joint optimization frameworks (MFS, GORGO) are recent advances (Sun et al., 18 Mar 2026, Toniolo et al., 12 Feb 2026).
7. Practitioner Recommendations and Emerging Directions
- Always report TTFT distributions as a function of prompt/context length (and modality) (Agrawal et al., 2024, Sun et al., 26 Nov 2025).
- Instrument TTFT at fine granularity—decompose queueing, prefill, transfer, and decode contributions; only this surfaces true sources of latency and allows for actionable SLO definition (Agrawal et al., 2024, Wang et al., 4 Aug 2025, Lai et al., 3 Dec 2025).
- Select TTFT optimization strategy to match workload: Pruning/token-dropping and speculator models excel in compressible, redundant, or long-context settings; scheduling and resource balancing are critical for highly bursty, multi-tenant loads.
- Integrate TTFT-focused metrics with streaming smoothness metrics: TTFT alone does not capture user experience under token stalls; fluidity-index and per-token deadline compliance should augment TTFT for benchmarking (Agrawal et al., 2024).
- Future directions: Adaptive, learned token-importance predictors; tighter integration of speculative decoding, KV-prediction, and prompt truncation; multi-stage, globally-aware scheduling; end-to-end TTFT-TPOT joint optimization (TaiChi, MFS); and real-time TTFT SLO negotiation per request (Liu et al., 5 Feb 2025, Lai et al., 3 Dec 2025, Sun et al., 18 Mar 2026, Wang et al., 4 Aug 2025).
References:
(Liu et al., 5 Feb 2025, Agrawal et al., 2024, Sun et al., 26 Nov 2025, Horton et al., 2024, Long et al., 8 Aug 2025, Fu et al., 2024, Shao et al., 27 May 2025, Sun et al., 17 Feb 2025, Shen et al., 17 Mar 2025, Lai et al., 3 Dec 2025, Tian et al., 18 Dec 2025, Wang et al., 4 Aug 2025, Chen et al., 3 Oct 2025, Lyu et al., 16 Oct 2025, Toniolo et al., 12 Feb 2026, Sun et al., 18 Mar 2026, 2505.23022, Lee et al., 9 Oct 2025, Cho et al., 12 Feb 2026, Dexter et al., 7 Feb 2025).