FoldAct: Efficient RL Context Folding
- FoldAct is a unified framework for efficient context folding in long-horizon reinforcement learning, enabling policy-driven summarization to manage growing interaction histories.
- It introduces separated loss computation and full-context consistency loss to address challenges like gradient dilution, self-conditioning collapse, and training cost explosion.
- Selective segment training reduces computation by up to 80.7% while maintaining performance, as validated by significant improvements in memory efficiency and training speed.
FoldAct is a unified framework for efficient and stable context folding in long-horizon reinforcement learning (RL) agents based on LLMs. It addresses the scalability and stability limitations inherent in standard RL paradigms when the agent’s interaction context grows arbitrarily large. The core innovations of FoldAct comprise separated loss computation for summary and action tokens, a full-context consistency loss to regularize observation distribution shifts, and selective segment training to reduce computational cost. FoldAct enables policy-driven context summarization while maintaining stable training dynamics and computational tractability in domains such as information retrieval, open-ended search, and resource-constrained dialogue systems (Shao et al., 28 Dec 2025).
1. Context Folding in Long-Horizon RL Agents
In conventional RL setups for LLM-based agents, each timestep involves observing the entire interaction history , generating an action , receiving environment response , and appending these to the context. This results in unbounded context growth, yielding expensive inference and prohibitive on-policy training costs. Context folding compresses this history by replacing prior chunks with summary tokens, yielding an observation prior to action . Crucially, summaries are generated by the policy, introducing policy-dependency into future observations and violating the RL assumption of stationary observation distributions. This setting induces three primary challenges: gradient dilution for summary tokens, self-conditioning collapse through summary policy feedback, and training cost explosion due to unique observation sequences at every turn (Shao et al., 28 Dec 2025).
2. Challenges in Policy-Driven Context Folding
FoldAct identifies and explicitly addresses three fundamental challenges induced by policy-dependent, non-stationary observations:
- Gradient Dilution (C1): Standard policy gradient approaches treat summary and action tokens identically. When summary tokens are a small fraction of the output, they receive proportionally less training signal, impairing summary policy learning.
- Self-Conditioning Collapse (C2): As policy updates change summary outputs, future observations are altered, producing a feedback loop where the training distribution continually shifts, potentially leading to collapse.
- Training Cost Explosion (C3): With unique contexts at each turn, on-policy algorithms such as PPO require distinct forward passes per step, causing combinatorially growing computational and memory costs (Shao et al., 28 Dec 2025).
3. Core Innovations: Loss Decomposition and Consistency Regularization
FoldAct introduces three mechanisms to systematically resolve the above challenges:
- Separated Loss Computation: The policy gradient is decomposed into separate surrogate losses for summary () and action () tokens. Let be generated tokens, with binary masks and . The PPO objective is optimized independently for each category using masked importance-sampling ratios and category-specific advantages . For summaries, rewards include a hallucination penalty of and a retention bonus of when summary tokens contribute to success.
- Full-Context Consistency Loss: To break the self-conditioning feedback loop, a KL-regularization term is applied, ensuring the compressed context induces similar policy distributions as the full context. This regularizes distribution shift and restores approximate stationarity.
- Selective Segment Training: Training efficiency is enhanced by stochastically subsampling a subset of timesteps with retention probability , evaluating losses only at . This reduces memory requirements and forward passes by up to 80.7%, while preserving policy performance for (Shao et al., 28 Dec 2025).
4. Training Algorithm and Computational Workflow
The FoldAct framework is instantiated via the following high-level protocol:
- Initialize policy parameters .
- Execute rollout, generating interaction history and retaining tuples .
- For sampled subset of turns , compute advantages, PPO ratios, separated losses, and KL consistency regularization as detailed in the training pseudocode.
- Update policy parameters via gradient descent on the aggregated total loss , where modulates the regularization strength.
The compressor replaces segments of history with the latest summary. Empirically, FoldAct with subsampled training and full-context consistency achieves dramatic reductions in peak memory (84.9 GB vs. >441 GB) and training latency (97.8 s/step vs. 4,846 s/step without consistency), delivering a speedup with consistency and with consistency loss ablated (Shao et al., 28 Dec 2025).
5. Empirical Performance and Benchmark Results
FoldAct’s effectiveness is validated on both retrieval-augmented generation (RAG) and web search environments:
- Local RAG: On HotpotQA, FoldAct-7B with consistency loss achieves F1/EM scores of 38.5/29.5, outperforming RL and few-shot baselines (best prior EM 22.7, 27.5). On PopQA, scores are 32.9/29.0 compared to prior EM 27.5.
- Web Search: FoldAct-7B with matches or outperforms larger agents (32B, e.g. ASearcher-Web-QwQ, GPT-4.1-mini) across WebWalker, GAIA, BrowseComp, and XBench-DeepSearch.
- Stability: Exclusion of precipitates training instability—KL divergence and output length become unbounded by step 173—whereas inclusion yields stable loss and output statistics throughout (Shao et al., 28 Dec 2025).
6. Significance, Limitations, and Outlook
FoldAct is the first RL framework to explicitly treat policy-generated summary tokens as distinct outputs that compress context while directly shaping the agent’s future observation space. By isolating gradient flow to summaries, regularizing context compression against full-history policy distributions, and sub-sampling training segments, FoldAct achieves stable, efficient learning for long-horizon LLM agents, with empirically demonstrated speedups and competitive benchmarks relative to model scale (Shao et al., 28 Dec 2025). This architecture suggests expanding RL paradigms to fully account for non-stationary context-transforming actions, and provides scalable foundations for information-intensive, open-ended agent deployments where context management is essential. A plausible implication is the applicability of FoldAct principles to broader context-adaptive models beyond dialogue and search, provided future work extends consistency regularization and loss decomposition to additional forms of interaction summarization.