Papers
Topics
Authors
Recent
Search
2000 character limit reached

FTED: First-Token Emission Delay in LLMs

Updated 2 April 2026
  • FTED is a latency metric that defines the interval from a request's receipt to the generation of the first output token in large language models.
  • It breaks down into dispatch, prefill, and decode components, with delays influenced by scheduling, memory management, and prompt processing.
  • Techniques such as buffer-aware preemptive scheduling, proactive KV-cache management, and layered prefill offer practical reductions in FTED while enhancing LLM responsiveness.

First-Token Emission Delay (FTED), also widely referred to as Time-to-First-Token (TTFT), is a fundamental latency metric in the serving of LLMs. FTED measures the elapsed wall-clock time from the moment a user’s request is received by the LLM serving system until the emission of the very first output token. As interactive LLM applications proliferate—particularly those requiring low-latency, streaming responses—FTED has become a primary indicator of system responsiveness and user-perceived interactivity.

1. Formal Definition and Measurement

FTED is defined as the time interval between request receipt and the generation or delivery of the first output token. In canonical form for a single request ii:

FTEDi=tiemit1−tirecv\mathrm{FTED}_i = t^{\mathrm{emit1}}_i - t^{\mathrm{recv}}_i

where tirecvt^{\mathrm{recv}}_i is the timestamp when request ii is registered by the serving system, and tiemit1t^{\mathrm{emit1}}_i is when the first output token is produced by the inference engine (Chen et al., 3 Oct 2025).

In clustered and distributed LLM serving architectures, FTED is typically decomposed into the sum of:

  • Scheduler-side dispatch and queuing delay (TdispatchT_{\rm dispatch})
  • Prompt processing or prefill latency (TprefillT_{\rm prefill})
  • First-token decode time (Tdecode_firstT_{\rm decode\_first})

yielding

FTED=Tdispatch+Tprefill+Tdecode_first\mathrm{FTED} = T_{\rm dispatch} + T_{\rm prefill} + T_{\rm decode\_first}

Measurement conventionally excludes client-side and network-propagation delays unless stated otherwise (Tian et al., 18 Dec 2025, Lai et al., 3 Dec 2025).

Standard metrics reported include mean, median (P50), and 99th percentile (P99) FTED, which collectively capture average and tail latency characteristics under load (Chen et al., 3 Oct 2025).

2. Computational Origins of FTED

In transformer-based LLMs, FTED primarily arises from two sources:

  • Prompt Processing (Prefill): Parsing the full input prompt and constructing the key-value (KV) cache across the model’s layers and tokens. The compute cost is a function of model size and prompt length.
  • Decode Iteration for First Token: Executing the initial autoregressive decode step to emit the first output token. The cost is relatively trivial compared to prompt prefill, except in highly synchronized data-parallel or expert-parallel systems (Tian et al., 18 Dec 2025, Lai et al., 3 Dec 2025).

A standard rough estimate for prefill compute in a transformer with parameter count PBP_B is FTEDi=tiemit1−tirecv\mathrm{FTED}_i = t^{\mathrm{emit1}}_i - t^{\mathrm{recv}}_i0 FLOPs per token, thus the total for prompt length FTEDi=tiemit1−tirecv\mathrm{FTED}_i = t^{\mathrm{emit1}}_i - t^{\mathrm{recv}}_i1 is FTEDi=tiemit1−tirecv\mathrm{FTED}_i = t^{\mathrm{emit1}}_i - t^{\mathrm{recv}}_i2 (Horton et al., 2024).

3. System-Level Determinants of FTED

3.1 Request Scheduling and Load Dynamics

Non-preemptive, First-Come-First-Served (FCFS) scheduling in conventional LLM systems leads to long queues under bursty arrivals and insufficient parallelism, inflating FTED. Device-side head-of-line blocking is a key source of delay in immediate-dispatch, non-synchronized scheduling (Tian et al., 18 Dec 2025). In complex architectures (e.g., Data-Parallel + Expert-Parallel clusters), immediate dispatch fragments global capacity and amplifies internal bottlenecks, leading to FTEDi=tiemit1−tirecv\mathrm{FTED}_i = t^{\mathrm{emit1}}_i - t^{\mathrm{recv}}_i3 average queuing delays where FTEDi=tiemit1−tirecv\mathrm{FTED}_i = t^{\mathrm{emit1}}_i - t^{\mathrm{recv}}_i4 is the per-batch compute time.

3.2 Memory Management and I/O

KV-cache management severely impacts FTED due to the high overhead of moving large memory segments between device, host, or across nodes, particularly under request preemption or context switches. Inefficient memory handling can dominate FTED even when compute is sufficient (Chen et al., 3 Oct 2025).

3.3 Scheduling and Prefill Strategies

Chunk-based prefill, where long prompts are segmented into small token chunks and interleaved with decoding stages, can maintain steady per-token output rates but imposes redundant expert weight loads in Mixture-of-Experts (MoE) models, leading to excessive memory traffic and inflated FTED (Lee et al., 9 Oct 2025). Conversely, layered prefill partitions the model by layer group, reducing redundant loads and achieving lower TTFT by up to 70%, with associated energy savings.

4. Techniques for Reducing FTED

4.1 Buffer-Aware Preemptive Scheduling

TokenFlow introduces a buffer-aware, preemptive scheduling algorithm, dynamically prioritizing requests based on the unread token buffer (FTEDi=tiemit1−tirecv\mathrm{FTED}_i = t^{\mathrm{emit1}}_i - t^{\mathrm{recv}}_i5) and estimated token generation utility (FTEDi=tiemit1−tirecv\mathrm{FTED}_i = t^{\mathrm{emit1}}_i - t^{\mathrm{recv}}_i6). Requests with sufficient pre-existing buffers can be safely preempted; their KV-caches are offloaded to host memory without stalling end-user consumption. This allows new arrivals to be served with minimal FTED, trading trivial context-switch I/O against the risk of buffer exhaustion (Chen et al., 3 Oct 2025).

TokenFlow’s mechanism reported:

  • Mean FTED reduction of 48–53%
  • P99 FTED reduction of 68–80%
  • Effective throughput increases of up to 82.5% without degrading raw token output

4.2 Proactive and Efficient KV-Cache Management

Write-through KV-cache policies, chunked synchronous transfers, and overlapped eviction/load cycles hide transfer latency by aligning I/O with compute execution, minimizing the per-context-switch cost to a few tens of milliseconds. This approach ensures FTED gains do not undermine steady-state throughput or system reliability (Chen et al., 3 Oct 2025).

4.3 Staggered Batch Scheduling

Staggered Batch Scheduling (SBS) employs a scheduler-side batching window (FTEDi=tiemit1−tirecv\mathrm{FTED}_i = t^{\mathrm{emit1}}_i - t^{\mathrm{recv}}_i7), releasing groups of waiting requests across the compute cluster in globally synchronized batches. This prevents device-side head-of-line blocking and parallelization bubbles, reducing average FTED by 30–40% and boosting throughput by 15–20% in large-scale deployments (Tian et al., 18 Dec 2025). Global allocation policies further optimize resource usage by load-aware balancing during both prefill and decode phases.

4.4 Predictive Autoscaling with Token Velocity

TokenScale introduces Token Velocity—a proactive metric unifying prefill, network, and decode stage rates. Prefiller and decoder autoscalers monitor token arrival rates and provision resources such that FTED meets fine-grained SLOs. Convertible Decoders act as rapid-response buffers, executing prefill operations in sub-millisecond timescales under burst traffic (Lai et al., 3 Dec 2025). In production traces, TokenScale elevated TTFT SLO attainment (e.g., from 62% to 89% for Llama-8B on conversational traffic) while reducing GPU cost.

4.5 Lightweight Inference via KV Prediction

KV Prediction uses a small auxiliary transformer and a predictor to generate approximate KV-caches for the base model, replacing an FTEDi=tiemit1−tirecv\mathrm{FTED}_i = t^{\mathrm{emit1}}_i - t^{\mathrm{recv}}_i8 prompt prefill with FTEDi=tiemit1−tirecv\mathrm{FTED}_i = t^{\mathrm{emit1}}_i - t^{\mathrm{recv}}_i9 cost (tirecvt^{\mathrm{recv}}_i0), yielding 2–4× FTED reductions on edge devices at moderate accuracy loss (Horton et al., 2024).

4.6 Layered Prefill for MoE Models

Layered prefill partitions transformer layers into groups and prefetches them sequentially, drastically reducing redundant expert weight loads compared to chunked prefill. In empirical studies, this strategy reduced mean TTFT by 56% for long-context tasks and cut per-token energy by up to 22% (Lee et al., 9 Oct 2025).

5. Analytical Models and Trade-Offs

Analytical modeling of FTED in LLM serving involves decomposing queueing, prompt processing, data transfer, and decode latencies. For chunked vs. layered prefill in MoE models, the analytical forms are:

  • Chunked prefill:

tirecvt^{\mathrm{recv}}_i1

  • Layered prefill:

tirecvt^{\mathrm{recv}}_i2

where tirecvt^{\mathrm{recv}}_i3 is the number of layers, tirecvt^{\mathrm{recv}}_i4 is prompt length, tirecvt^{\mathrm{recv}}_i5 is the MoE expert selection multiplicity, tirecvt^{\mathrm{recv}}_i6 is expert parameter size, tirecvt^{\mathrm{recv}}_i7 is chunk size, and tirecvt^{\mathrm{recv}}_i8 is memory bandwidth (Lee et al., 9 Oct 2025).

Queueing analysis in distributed settings uses M/D/1 and M/D/tirecvt^{\mathrm{recv}}_i9 (Erlang-C) models to capture wait times under different dispatch paradigms (Tian et al., 18 Dec 2025).

Trade-offs are present between FTED (TTFT), time-between-tokens (TBT), energy consumption, throughput, and system resource utilization. For example, smaller prefill chunk sizes benefit TBT but exacerbate FTED due to increased expert reloads (Lee et al., 9 Oct 2025).

6. Deployment Considerations and Best Practices

Lessons distilled from system studies and microbenchmarks:

  • Exploit slack in consumer read-rate vs. generation rate by building buffers for preemption (Chen et al., 3 Oct 2025).
  • Use predictive, fine-grained indicators like Token Velocity to trigger scaling and avoid reactive overshoot (Lai et al., 3 Dec 2025).
  • Schedule globally rather than immediately, leveraging batching windows to unify fragmented capacity and smooth synchronization overheads (Tian et al., 18 Dec 2025).
  • Balance working set breadth and preemption aggressiveness to optimize FTED without introducing output stutter or violating per-token SLOs (Chen et al., 3 Oct 2025).
  • In energy- and bandwidth-constrained environments, prefer scheduling along the transformer-layer axis versus the token axis to minimize memory traffic (Lee et al., 9 Oct 2025).

Critical deployment parameters (batching interval, buffer conservativeness, group sizing) require tight tuning for workload and hardware context to realize full FTED reductions without violating throughput, per-token latency, or service-level constraints.

7. Limitations and Future Directions

  • Linear predictors in KV Prediction have diminishing returns on deeper models due to growing error accumulation; future work may employ nonlinear mapping or cross-layer attention (Horton et al., 2024).
  • Scheduler tuning for batching intervals is sensitive to arrival variance and may briefly violate FTED SLOs under severe burstiness, even with adaptive controllers (Tian et al., 18 Dec 2025).
  • Memory management schemes must be aligned with hardware interconnect bandwidth to avoid shifting FTED bottlenecks elsewhere (e.g., PCIe, NVLink) (Lee et al., 9 Oct 2025).
  • Trade-offs between FTED and accuracy or energy require workflows to explicitly trace pareto-optimal frontiers for specific use cases.

Minimizing FTED, while preserving throughput and per-token latency, demands a multi-dimensional co-design of scheduling, memory management, autoscaling, compute strategy, and resource-aware adaptation, as consistently demonstrated in state-of-the-art LLM serving research (Chen et al., 3 Oct 2025, Lee et al., 9 Oct 2025, Lai et al., 3 Dec 2025, Tian et al., 18 Dec 2025, Horton et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to First-Token Emission Delay (FTED).