Overflow Prevention Enhances Long-Context Recurrent LLMs (2505.07793v1)

Published 12 May 2025 in cs.LG and cs.AI

Abstract: A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.

Summary

The paper demonstrates that preventing memory overflow in recurrent LLMs significantly improves long-context retrieval accuracy.
It introduces OPRM, a training-free strategy that splits contexts into fixed-size chunks to mitigate recurrent memory limitations.
Empirical results on benchmarks like LongBench show that OPRM extends context lengths and outperforms similar-sized transformer models for very long inputs.

This paper, "Overflow Prevention Enhances Long-Context Recurrent LLMs" (2505.07793), investigates a key limitation of recurrent LLMs like Mamba and RWKV when processing very long contexts: the fixed capacity of their recurrent memory leads to information "overflows" and degradation in performance, even when these models are trained on long sequences.

The authors hypothesize that this fixed memory capacity is the primary bottleneck preventing recurrent models from fully utilizing long contexts. They demonstrate this problem using the Associative Recall (AR) task, where models must recall values associated with keys presented earlier in the context. Experiments show that as the number of facts (key-value pairs) in the context increases, the retrieval accuracy of leading recurrent LLMs (like Falcon-Mamba-Inst-7B) drops significantly, exhibiting an overflow-like behavior even with relatively short input lengths (1200 tokens). This observation is further validated in a controlled setup training 2-layer Mamba models, where increasing the hidden state size helps but does not prevent the overflow entirely.

To address this, the paper proposes Overflow Prevention for Recurrent Models (OPRM), a simple, training-free inference strategy. OPRM works by splitting the potentially very long context into smaller chunks of fixed length. It then processes each chunk independently in parallel during a "speculative prefill" phase, obtaining a recurrent state and the output distribution for the first decoding token for each chunk. In the "selective decoding" phase, OPRM selects the most relevant chunk based on a predefined criterion (either minimizing the entropy of the output distribution or maximizing the probability of query tokens in the suffix). The model then performs standard auto-regressive decoding conditioned only on the state derived from the selected chunk. An optional "IDK (I Don't Know) Filter" is introduced to discard chunks that confidently predict an error token, preventing selection of irrelevant segments.

OPRM offers practical advantages:

Mitigates Overflows: By processing smaller chunks, the amount of information the model's fixed state must encode at any given time is limited, preventing the overflow observed in the baseline.
Efficient Computation: For prompts where the context is much longer than the prefix and suffix ( $|C| \gg |P|, |S|$ ), OPRM can reduce prefill complexity. If the context is known beforehand, the states for each chunk $[P, C_i]$ can be precomputed, making real-time query processing very efficient ( $O(b|S|)$ sequential prefill).
Context Extension: The method naturally extends the model's usable context length beyond what it was trained on, simply by creating more chunks.
Memory-Recall Tradeoff: A single hyperparameter, the chunk size ( $L$ ), allows balancing the risk of overflow (larger $L$ ) with the number of states to manage (smaller $L$ implies more chunks and states).
RAG Compatibility: The structure fits well with Retrieval-Augmented Generation (RAG) settings, where retrieving the most relevant chunk before decoding is a core idea.

The paper evaluates OPRM on several benchmarks:

Zero-Shot Associative Recall: OPRM effectively eliminates the performance drop seen in the baseline recurrent model as the number of facts increases, maintaining high accuracy.
LongBench and LongBench v2: Applying OPRM significantly improves the performance of various recurrent LLMs (Falcon-Mamba, Falcon3-Mamba, RecurrentGemma, RWKV6) across diverse long-context tasks. The improvements are particularly pronounced on longer context lengths (4k+ tokens). On the challenging LongBench v2, Falcon3-Mamba-Inst-7B + OPRM achieves a state-of-the-art score (30.8) among 7B parameter models, being competitive with or outperforming equivalent-sized Transformer models, especially on contexts longer than 32k words.
Context Extension Tasks (Needle in a Haystack, Document Retrieval): OPRM applied to Mamba models trained on short sequences (2k tokens) demonstrates significant context extension capabilities, enabling effective processing of sequences up to 512k tokens in Needle in a Haystack and 50k tokens in Document Retrieval. It often outperforms dedicated context extension methods for recurrent models, suggesting that preventing overflows is a critical factor in length extrapolation.

Ablation studies confirm the effectiveness of the minimum entropy criterion for chunk selection, showing it outperforms random selection or maximizing query probability. The IDK Filter is also shown to be beneficial, particularly for Document QA tasks and longer contexts, by preventing the selection of chunks where the model is confident that the answer is not present. The method is shown to be relatively robust to the choice of chunk size $L$ .

The findings raise questions about whether recurrent models, as currently designed, truly exploit long-range dependencies across widely separated parts of the input, given that a single-chunk approach achieves strong performance.

Limitations include the current lack of cross-chunk processing in OPRM and its reliance on the base model's existing capabilities (e.g., the IDK filter might benefit from model-specific fine-tuning).

The paper provides implementation details, including specific model checkpoints and evaluation procedures, to support reproducibility. The code is based on standard HuggingFace and Mamba implementations.

PDF Markdown

Tweets

https://twitter.com/fly51fly/status/1922412000853803108

https://twitter.com/abk_tau/status/1922286584839225612

https://twitter.com/GptMaestro/status/1936244009040855364

Overflow Prevention Enhances Long-Context Recurrent LLMs (2505.07793v1)

Summary

Related Papers

Tweets