Time-To-First-Token in AI Inference
- Time-To-First-Token (TTFT) is a latency metric that measures the elapsed time from complete input receipt to the generation of the first output token in LLM and multimodal systems.
- It accounts for queuing delays, prompt prefill computations, and initial autoregressive steps, with techniques like KV prediction and speculative prefill reducing latency significantly.
- Optimizing TTFT enhances user experience and system efficiency, proving crucial for interactive chat, real-time streaming, and scalable AI inference architectures.
Time-To-First-Token (TTFT) is a fundamental metric in LLM and multi-modal model inference, quantifying the latency between prompt submission and the emission of the first output token. TTFT captures both system scheduling delays and the computational cost of the prompt “prefill” phase, critically shaping user-perceived responsiveness across interactive and streaming AI systems.
1. Formal Definition and Computational Breakdown
TTFT is defined as the elapsed wall-clock time from when the model receives the complete input (textual prompt or multimodal signal) until the first token is generated:
This metric comprises (i) queuing/scheduling overhead, (ii) prompt prefill computation (embedding, key-value cache generation across all layers), and (iii) the autoregressive inference step for the first token. In transformer-based autoregressive models, the dominant term is typically prefill FLOPs, scaling as for -token prompts (Horton et al., 10 Oct 2024), or for naive attention implementations.
In multi-modal models and video LLMs, TTFT extends to include image/video preprocessing, patch embedding, visual encoding (e.g., via Vision Transformers), and compression layers before context injection (Sun et al., 26 Nov 2025, Shao et al., 27 May 2025).
2. TTFT in LLM Serving Architectures and Workloads
TTFT acts as the primary user-facing latency metric in real-world deployments. For text LLMs, it governs the responsiveness of chat, QA, and code-completion workloads (Horton et al., 10 Oct 2024, Shen et al., 17 Mar 2025, 2505.23022, Tian et al., 18 Dec 2025). For multimodal and video LLMs, TTFT reflects the latency from multimedia ingestion to first output, relevant for live Q&A, robotics, and streaming agents (Sun et al., 26 Nov 2025, Shao et al., 27 May 2025).
Empirical TTFT exhibits strong sensitivity to model scale, device locality, and prompt length. On servers, TTFT often shows heavy-tailed variance due to scheduling and network jitter, while on edge devices (e.g., smartphones) TTFT correlates linearly with prompt length (Pearson ) (Sun et al., 17 Feb 2025, Chen et al., 1 Aug 2025, Wang et al., 4 Aug 2025). In DP+EP distributed clusters, TTFT reflects dispatch latencies, batch processing synchronization, and network transfer (Tian et al., 18 Dec 2025).
3. Algorithmic and System-Level TTFT Optimization
3.1 Prefill-Phase Reductions
- KV Prediction (KVP): A small auxiliary model processes the prompt; a lightweight linear map predicts the full base KV-cache, enabling base model autoregressive generation with only single-token computation for token 1, instead of an expensive -token pass (Horton et al., 10 Oct 2024). This yields speedups (up to on CPUs) with favorable accuracy/FLOPs tradeoffs—retaining – more accuracy than baselines for fixed FLOPs budgets.
- Speculative Prefill: A training-free pipeline judges contextual token importance with a lightweight speculator LLM, selecting only a fraction of prompt tokens for prefill (e.g., ), reducing TTFT by up to on trillion-parameter models with negligible accuracy loss on long-context QA (Liu et al., 5 Feb 2025). Smooth chunked selection and position restoration ensure architectural compatibility in vLLM inference engines.
3.2 Scheduling and Batching
- Staggered Batch Scheduling (SBS): In DP+EP clusters, SBS shifts queuing into the scheduler, using optimal buffering intervals and load-aware batch allocation to eliminate internal engine queueing bubbles. SBS delivers $30$- TTFT reduction and $15$- higher throughput versus immediate-dispatch baselines (Tian et al., 18 Dec 2025).
- TTFT-SLO-Aware Scheduling: Systems such as SCORPIO (2505.23022) and FairBatching (Lyu et al., 16 Oct 2025) use deadline-based sorting and admission control to prioritize requests by TTFT thresholds, reject or demote unattainable requests, and dynamically split GPU resources between prefill and decode tasks for global fairness, routinely halving P99 TTFT tail latency.
- Hybrid PD Aggregation/Disaggregation: Systems (e.g., TaiChi (Wang et al., 4 Aug 2025)) balance prefill-heavy and decode-heavy GPU allocation via tunable sliders for chunk size ratios, dynamically shifting latency between TTFT and TPOT to maximize SLO goodput under mixed constraints.
3.3 Token and Cache Management
- Token Pruning and Information Diffusion: LazyLLM (Fu et al., 19 Jul 2024) and SlimInfer (Long et al., 8 Aug 2025) exploit attention-guided and blockwise pruning of prompt tokens across layers, leveraging semantic diffusion to enable aggressive reductions in prefill computation ($2$- TTFT speedup) with near-baseline accuracy up to $128$K contexts.
- Position-Independent Caching (EPIC): Modular reuse of precomputed KV vectors across requests with chunk-wise boundary recomputation drastically reduces TTFT (up to versus prefix-based caching), supporting rapid long-context inference (Hu et al., 20 Oct 2024).
- KV Cache Competition Mitigation: CacheOPT applies output-length prediction and demand-based KV allocation to minimize TTFT and TBT stragglers, achieving tail TTFT reductions and higher SLO attainment (Shen et al., 17 Mar 2025).
4. TTFT in Multimodal and Video LLMs
Visual and video encoding imposes substantial initial latency, expanding TTFT beyond text prefill. Key innovations include:
- Progressive Visual Compression (PVC): RPE and Windowed Token Compression in ViT encoders reduce visual token length by and ViT compute by $50$-, shrinking TTFT by $1.9$- over rivals without accuracy loss (Sun et al., 26 Nov 2025).
- Holistic Token Merging (HoliTom): Combining outer-LLM redundancy-aware temporal and spatial token merging with inner-LLM similarity-based token merges, HoliTom slashes TTFT from $3.4$ s to $1.5$ s (), while retaining score (Shao et al., 27 May 2025).
These approaches prioritize merging redundant visual tokens prior to LLM input, with further efficiency from inner-LLM similarity aggregation at early layers.
5. TTFT as a QoE and Autoscaling Driver
Beyond latency, TTFT acts as a critical QoE metric for interactive, streaming systems and as a control signal for autoscaling:
- QoE Optimization: Device-server collaborative frameworks, such as DiSCo, use TTFT and TBT metrics to route, migrate, and buffer requests across endpoints, achieving $11$- tail TTFT improvement and up to cost reduction under diverse budgets (Sun et al., 17 Feb 2025).
- Autoscaling via Token Velocity: TokenScale introduces a stage-unified “Token Velocity” metric to proactively predict and mitigate TTFT-SLO violations. Convertible Decoders absorb prefill spikes, avoiding instantiation cold starts and maintaining $80$- SLO attainment ($4$- lower cost) compared to reactive scaling policies (Lai et al., 3 Dec 2025).
TTFT is also foundational for composite metrics, such as fluidity-index, which incorporates per-token deadlines to provide a more holistic assessment of streaming responsiveness (Agrawal et al., 9 Jul 2024).
6. Quantitative TTFT Improvements Across Techniques
| Approach/Method | Reported TTFT Speedup | Context/Model | Accuracy Preservation |
|---|---|---|---|
| KV Prediction | $1.76$- | OpenELM 1.1B, Apple M2 Pro | $15$- relative gain |
| Speculative Prefill | Llama-3.1-405B-Instruct-FP8 | 95\% on QA | |
| SBS Scheduling | $30$- reduction | DeepSeek-V3 DP+EP Aligned Cluster | $15$- throughput |
| SCORPIO | up to SLO adherence | ShareGPT+LMSYS trace | goodput |
| CacheOPT | (tail) | OPT-13B/175B, LLaMA3, vLLM, RLP | arrival rate |
| LazyLLM/SlimInfer | $2.34$- | LLaMA2/XGen/LongBench | 1\% accuracy drop |
| EPIC/AttnLink | $3$- | Llama3.1, Mistral7B, Yi Coder 9B | 7\% Rouge/F1 drop |
| HoliTom | LLaVA-OneVision-7B Video LLM | score retained | |
| LLaVA-UHD v3 | $1.9$- | MLLM (ViT-UHD, Qwen2-VL) | competitive w/ MoonViT/Qwen2-VL |
7. Open Challenges, Trade-Offs, and Future Directions
While TTFT minimization is central for user satisfaction and hardware efficiency, several trade-offs and limitations persist:
- Accuracy vs. Aggressive Pruning: Techniques that aggressively reduce TTFT via token dropout or boundary recomputation may incur small but measurable drops in accuracy for certain information-dense, short-context tasks (Fu et al., 19 Jul 2024, Hu et al., 20 Oct 2024, Liu et al., 5 Feb 2025).
- Predictor Complexity vs. Overhead: Linear predictors for KV map estimation accumulate drift over deep stacks (Horton et al., 10 Oct 2024). Further research is warranted into nonlinear or cross-layer predictors.
- Prompt-Length Sensitivity: TTFT grows superlinearly with prompt size in quadratic-attention transformers (Agrawal et al., 9 Jul 2024); context compression and blockwise designs (EdgeInfinite-Instruct) mitigate this but require nontrivial system-level adaptation (Chen et al., 1 Aug 2025).
- Streaming/On-the-Fly Video: Methods such as HoliTom rely on offline video segmentation and are not yet generalizable to online streaming environments (Shao et al., 27 May 2025).
- Resource Allocation and SLO Attainment: Scheduling policies that exploit SLO heterogeneity and admission control (SCORPIO, FairBatching) increase fairness and goodput, but dynamic tuning under bursty loads remains a challenge.
Research continues to explore dynamic layer selection, predictor-integrated prefill pipelines, cross-modal token merging, and auto-adaptive resource allocation to further advance TTFT efficiency while balancing end-to-end quality.
TTFT thus remains the central latency criterion in LLM and MLLM serving, with its reduction driving real-time responsiveness, cost efficiency, and system scalability across diverse workloads, languages, and modalities. Optimization of TTFT is an active research frontier with significant implications for next-generation interactive intelligence infrastructure.