FoldGRPO: RL for Context Folding
- FoldGRPO is an end-to-end RL framework that dynamically optimizes context folding for long-horizon LLM agents by managing memory through procedural task decomposition.
- It introduces specialized branch and return actions to effectively collapse lengthy interaction histories while preserving essential context with concise summaries.
- Empirical evaluations show that FoldGRPO significantly improves finish rates and scope accuracy compared to traditional methods while compressing context by over 90%.
FoldGRPO is an end-to-end reinforcement learning framework designed for learning context folding in long-horizon LLM agents. It enables procedural task decomposition via dynamic branching and context folding, allowing LLM agents to manage their working memory efficiently during multi-turn, tool-augmented reasoning and acting sequences. FoldGRPO integrates specialized process rewards and group-relative policy optimization to jointly optimize folding behavior and task performance, yielding major improvements on complex, memory-intensive reasoning tasks compared to existing methods (Sun et al., 13 Oct 2025).
1. Formalization of Context Folding
FoldGRPO operates in the ReAct-style multi-turn agent setting, where an agent interacts with an environment via actions (text, reasoning, or tool calls) and receives observations. The core challenge is the unbounded accumulation of interaction history, which rapidly exceeds the LLM’s context window on long-horizon tasks.
FoldGRPO introduces two special management actions:
- : Forks a sub-trajectory in a new, separate context for a specified subtask.
- : Collapses (“folds”) all intermediate steps of the current branch, retains only the concise summary in , and appends it to the parent context.
Let denote the context manager that, given the full interaction history up to step , folds away all segments between matching branch/return tool calls. The FoldGRPO agent’s policy is therefore:
where is the agent’s initial query or prompt, and the folded context ensures a bounded “active token budget” at all times, regardless of the total interaction length.
2. Agent Architecture and Management of Folding
The FoldGRPO agent distinguishes between two operational states:
- Planning state: The agent operates on the main-thread context, selects reasoning/tool actions, or chooses to branch.
- Execution state: Within a branch, further branching is disabled until a return action occurs. Upon return, the branch’s summary is appended to its spawn-point in the parent context.
Implementation details:
- Each branch maintains a separate KV-cache for efficient context rollback on folding.
- Once a branch is folded, all its intermediate tokens are removed from the LLM’s current context (with operations), replaced by a concise summary message.
This architecture guarantees that no step of the agent ever exceeds its active memory budget, even as the total (folded) trajectory grows large.
3. FoldGRPO RL Objective, Signal Design, and Optimization
FoldGRPO extends Group Relative PPO (GRPO) by integrating process rewards and folding events into the RL optimization. The learning loop is characterized by:
- State: The context may be either (i) the active main thread (≤ tokens) or (ii) an active branch context.
- Actions: Ordinary LLM tokens (reasoning/tool calls), 0, and 1.
- Rewards:
- Token-level process rewards 2, penalizing overlong main contexts, out-of-scope branching (using an LLM judge), and failed tool calls.
- A final trajectory-level verification reward 3, e.g., pass@1 or unit test pass.
The policy-gradient objective is:
4
where the advantage estimator is
5
and 6 is the per-token importance ratio. Only LLM-generated tokens are used for gradient updates; tool-observation tokens are masked out. No separate critic is required due to the group-relative advantage structure.
4. Folding-Summary Generation and Prompting Protocol
Whenever the agent emits 7, context folding occurs:
- The sub-trajectory is collapsed, and only the summary tokens from 8 are appended to the parent context.
- The summary is produced by the LLM using the branch’s original prompt, augmented with an instruction to produce a concise, stand-alone summary (typically ≤100 tokens).
Prompt engineering details:
- The “branch” action uses tightly-specified subtask prompts (e.g., “Verify eligibility criteria for Paper X”).
- The “return” action receives a fixed summarization prompt, ensuring the generated summary captures all necessary findings for the upper-level reasoning thread.
5. Empirical Evaluation and Ablation Analysis
FoldGRPO was systematically benchmarked on two challenging long-horizon settings:
- Deep Research (BrowseComp-Plus): Agents must search, retrieve, and synthesize information across multi-step web browsing using tool calls.
- Agentic Software Engineering (SWE-Bench Verified): Agents interact with a software environment using tools (e.g., bash execution, code editing) and must assemble multi-step fix sequences, validated by unit tests.
Key results (on BC-Plus / SWE):
- ReAct+PPO yields pass@1 of 0.446 / 0.480.
- Summary-agent (contextual summarization only) + PPO: 0.527 / 0.550.
- Folding-agent + FoldGRPO: 0.620 / 0.580, matching or outperforming larger 100B+ LLM baselines (GPT-5: 0.793/0.718), while using a main context ≈8K tokens—over 90% compression vs. naive ReAct (Sun et al., 13 Oct 2025).
Ablation studies show:
- Without RL, the folding agent compresses some context but finish rate stalls at ≈0.80.
- With standard GRPO, scope accuracy and finish rate both degrade (0.738 finish).
- FoldGRPO restores high finish rates (0.935), high scope-accuracy (0.895), and compacts the main-thread context to ≈7.7K tokens.
This demonstrates that both process rewards and dynamic policy optimization of folding decisions are critical for effective long-horizon performance.
6. Limitations and Future Directions
FoldGRPO as published is limited to one-layer, depth-first branching; multi-layer (hierarchical) folding is not yet supported, and parallel/breadth-first branching yielded only modest benefits. Dependence on an external LLM (GPT-5-nano) for out-of-scope detection introduces external supervision, which may be undesirable in some deployment settings. Proposed directions include:
- Hierarchical/multilayer folding strategies,
- Learned adaptive branching policies (depth vs. breadth balancing),
- Integration with external vector-databases for cross-branch context retrieval,
- Joint training of in-scope judges to reduce reliance on out-of-band LLMs.
7. Significance and Positioning
FoldGRPO represents the first end-to-end RL system that enables learned, scalable, and highly compressive context management in LLM agents conducting long-horizon reasoning and tool use. It provides a concrete alternative to simple heuristic summarization or to architectural approaches based on ever-expanding context windows. The dense process reward structure and the group-relative advantage normalization prove essential for sample-efficient and stable learning of folding policies. FoldGRPO sets a precedent for fully integrated memory management and reasoning in modern agentic LLM frameworks (Sun et al., 13 Oct 2025).