Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 33 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 74 tok/s Pro

Kimi K2 188 tok/s Pro

GPT OSS 120B 362 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters (2406.12335v2)

Published 18 Jun 2024 in cs.CL and cs.LG

Abstract: Scaling the context size of LLMs enables them to perform various new tasks, e.g., book summarization. However, the memory cost of the Key and Value (KV) cache in attention significantly limits the practical applications of LLMs. Recent works have explored token pruning for KV cache reduction in LLMs, relying solely on attention scores as a token importance indicator. However, our investigation into value vector norms revealed a notably non-uniform pattern questioning their reliance only on attention scores. Inspired by this, we propose a new method: Value-Aware Token Pruning (VATP) which uses both attention scores and the $ \ell_{1} $ norm of value vectors to evaluate token importance. Extensive experiments on LLaMA2-7B-chat and Vicuna-v1.5-7B across 16 LongBench tasks demonstrate that VATP outperforms attention-score-only baselines in over 12 tasks, confirming the effectiveness of incorporating value vector norms into token importance evaluation of LLMs.

Citations (8)

View on Semantic Scholar