RecaLLM: Robust Retrieval for Long Contexts
- RecaLLM is a class of post-trained large language models that alternates chain-of-thought reasoning with evidence copying to counteract the lost-in-thought phenomenon.
- It employs constrained decoding to enforce a strict alternation between reasoning and retrieval, ensuring robust performance even with prolonged context sequences.
- Empirical results show that RecaLLM achieves state-of-the-art long-context retrieval accuracy, scaling efficiently to context lengths up to 128K tokens despite limited training windows.
RecaLLM denotes a class of post-trained LLMs designed to explicitly interleave chain-of-thought (CoT) reasoning with robust, verifiable in-context retrieval. The approach targets the “lost-in-thought” phenomenon, wherein the act of multi-step reasoning degrades the model’s ability to access and verbatim retrieve evidence from long contexts. By enforcing alternation between reasoning and copying of evidence via constrained decoding, RecaLLM achieves state-of-the-art long-context performance with negligible computational overhead and minimal dependence on extremely long training samples, scaling to context lengths up to 128K tokens despite training on windows of at most 10K tokens (Whitecross et al., 10 Apr 2026).
1. Motivation: The Lost-in-Thought Phenomenon
RecaLLM was introduced to address a key bottleneck observed in long-context LLMs: as reasoning traces lengthen, faithful in-context retrieval performance deteriorates substantially. This “lost-in-thought” effect is quantified as a stark drop in retrieval accuracy after any sequence of reasoning tokens, even when the retrieval task that follows would be trivial in isolation:
- Let be the accuracy of a direct key-value retrieval, and the accuracy when retrieval is requested after a reasoning sequence of length .
- Empirically, for several open-source 7–8B parameter models (Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, ProLong-8B-512K, etc.), fell from ~80% to ~40% after only a short CoT trace at 4K context, and from ~25% to ~5% at 128K [(Whitecross et al., 10 Apr 2026), Table 1].
Injection studies confirmed that even forcibly re-exposing the model