Are all prompt tokens essential for first-token generation?

Determine whether all tokens in an input prompt are essential for predicting the first generated token during the prefilling stage of autoregressive transformer-based large language models (the initial pass that computes the key–value cache for all prompt tokens).

Background

The paper focuses on improving time-to-first-token (TTFT) in LLM inference, which is dominated by the prefilling stage where key–value (KV) caches are computed for every token in long prompts. Since attention computation scales quadratically with prompt length, TTFT can become a significant bottleneck in user-facing latency.

Motivated by observed sparsity in attention to the next token, the authors raise whether it is necessary to process all prompt tokens to generate the first token. They propose LazyLLM, which selectively computes KV for tokens identified as important and defers others, aiming to test the hypothesis that many tokens are not essential for the immediate next-token prediction.

References

An open question remains whether all prompt tokens are essential for generating the first token.

— LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference (2407.14057 - Fu et al., 2024) in Abstract

Are all prompt tokens essential for first-token generation?

Background

References

Related Problems