AgentFold: Long-Horizon Web Agents with Proactive Context Management (2510.24699v1)

Published 28 Oct 2025 in cs.CL, cs.AI, and cs.LG

Abstract: LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.

Summary

The paper presents AgentFold’s novel approach by actively managing context through granular condensation and deep consolidation to prevent saturation.
It demonstrates state-of-the-art performance with sub-linear token growth and significant memory savings even after 100+ interactions.
The methodology enhances scalable and robust agentic reasoning for long-horizon tasks, paving the way for efficient web agents.

AgentFold: Proactive Context Management for Long-Horizon Web Agents

Introduction and Motivation

AgentFold introduces a paradigm shift in LLM-based web agents by addressing the critical challenge of context management in long-horizon information-seeking tasks. Traditional ReAct-based agents accumulate exhaustive histories, leading to context saturation and degraded reasoning due to noise. Conversely, agents employing uniform, step-wise summarization risk irreversible loss of crucial details. AgentFold resolves this trade-off by treating context as a dynamic cognitive workspace, actively sculpted through learned folding operations inspired by human retrospective consolidation.

Figure 1: AgentFold-30B-A3B matches or surpasses much larger agents on long-horizon benchmarks, enabled by proactive context folding that maintains concise context even after 100+ turns.

AgentFold Architecture and Context Management

AgentFold's context is partitioned into four components: the invariant user question, available tools, multi-scale state summaries (long-term memory), and the latest interaction (working memory). The agent's operational loop at each step consists of perceiving the current context, reasoning, issuing a folding directive, and acting. The folding directive operates at two scales:

Granular Condensation: Converts the latest interaction into a fine-grained summary block, preserving high-resolution details.
Deep Consolidation: Fuses the latest interaction with a chain of prior summaries into a single, coarse-grained block, abstracting away completed sub-tasks or failed investigations.

This enables AgentFold to maintain both situational awareness and long-term coherence, dynamically balancing detail retention and context conciseness.

Figure 2: AgentFold context at an intermediate step, showing multi-scale state summaries and latest interaction, with folding directives enabling both granular and deep consolidation.

Training Methodology

AgentFold requires a specialized dataset of trajectories demonstrating both situational action and strategic context curation. The Fold-Generator pipeline leverages open-source LLMs and rejection sampling to produce high-quality, structured interaction pairs for supervised fine-tuning. This approach internalizes the folding skill, moving beyond fragile prompt engineering and enabling efficient inference.

Experimental Results

AgentFold-30B-A3B, trained on Qwen3-30B-A3B, achieves state-of-the-art results on BrowseComp (36.2%), BrowseComp-ZH (47.3%), WideSearch (62.1%), and GAIA (67.0%). It outperforms open-source agents up to 20x larger (e.g., DeepSeek-V3.1-671B) and matches or surpasses leading proprietary agents such as OpenAI's o4-mini.

Figure 3: Growth curve of AgentFold's context, showing sub-linear token count increase over 100 turns, remaining well below model capacity.

AgentFold's context length grows sub-linearly, doubling from ~3.5k to ~7k tokens over 100 turns, compared to uncontrolled linear growth in ReAct agents. The number of context blocks also grows sub-linearly due to deep consolidation, maintaining structural simplicity and cognitive manageability.

Figure 4: Case paper illustrating AgentFold's multi-scale context structure and strategic deep consolidation after a series of failed attempts.

Figure 5: Context of case 1 at step 17, showing both fine-grained and consolidated summary blocks.

Figure 6: Response of case 1 at step 17, demonstrating folding directive and re-planning after recognizing a dead end.

Scaling experiments demonstrate AgentFold's robustness: accuracy continues to improve up to 256 turns, while baseline agents saturate and fail due to context overflow. Extended experiments with 500 turns show context remains below 20k tokens, with non-monotonic growth due to self-correcting deep consolidation.

Figure 7: Context of case 2 at step 45, illustrating context structure in another long-horizon trajectory.

Figure 8: Response of case 2 at step 45, showing folding and action planning in a complex scenario.

Theoretical and Practical Implications

AgentFold's proactive context management fundamentally advances agentic reasoning by integrating context curation as a learnable, core action. The agent autonomously decides what to remember, abstract, or discard, mitigating both context saturation and compounding risk of information loss. Quantitatively, the probability of a key detail surviving 100 uniform summarizations is only ~36.6% ( $0.99^{100}$ ), collapsing to 0.66% after 500 steps. AgentFold's granular condensation preserves such details, while deep consolidation prunes irrelevant history, yielding both robustness and computational efficiency.

Practically, AgentFold enables scalable, cost-effective deployment of long-horizon agents, with significant memory savings (up to 7GB per instance at 100 turns) and the ability to tackle tasks requiring hundreds of interactions. The architecture is compatible with further optimization via RL, which could enable discovery of non-obvious folding policies and further improve task success rates.

Future Directions

The current implementation relies on supervised fine-tuning. Future work should explore RL-based optimization for autonomous folding policy discovery, integration with external context augmentation, and application to broader agentic domains beyond web information seeking. The paradigm of self-aware knowledge management is likely to become foundational for next-generation agentic systems.

Conclusion

AgentFold establishes a new standard for long-horizon web agents by resolving the trade-off between context saturation and information loss through proactive, multi-scale context folding. Its architecture and training methodology yield superior performance and efficiency, enabling agents to sustain hundreds of interactions with focused, coherent reasoning. The implications for scalable, robust agentic systems are substantial, and further research into autonomous context management is warranted.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

A simple explanation of “AgentFold: Long-Horizon Web Agents with Proactive Context Management”

What is this paper about?

This paper introduces AgentFold, a computer program (an “AI web agent”) that browses the internet to answer complex questions. The big idea is to help the agent manage its “memory” better during long tasks, so it doesn’t get overwhelmed by too much information or lose important details by summarizing too aggressively.

Think of it like doing a long school project online: if you copy every single thing you see, your notes become a mess. But if you summarize too early, you might throw away something important. AgentFold learns to keep its notes clean and useful by “folding” its memory—saving key details and merging finished parts into neat summaries at the right times.

What questions did the researchers want to answer?

The paper focuses on a few simple questions:

Can an AI agent manage its own memory while browsing, so it neither drowns in details nor forgets crucial facts?
Can this help the agent solve long, multi-step tasks (called “long-horizon” tasks) more accurately?
Can a smarter memory strategy let a smaller model beat bigger, more expensive ones?

How does AgentFold work? (Methods in plain language)

AgentFold treats its memory like a tidy workspace with two kinds of notes:

Multi-Scale State Summaries: short, organized notes about what’s already been done (like clean summaries in your notebook).
Latest Interaction: a full, detailed record of the most recent step (like a fresh sticky note with all the details you just found).

At each step, the agent does three things:

Decides how to “fold” its memory:
- Granular Condensation: turns the latest detailed step into a short, precise summary (like condensing a paragraph into one line).
- Deep Consolidation: merges several related steps into one higher-level summary once a mini-task is finished (like wrapping up a whole chapter with one takeaway).
Explains briefly why it’s taking the next step.
Takes an action, like searching the web or opening a page.

After acting, the result gets added as the new “Latest Interaction,” and the cycle repeats.

How they trained it:

The team built a special data pipeline called “Fold-Generator” to create good examples of how to browse and fold memory properly.
They filtered out bad examples using rejection sampling (if a step was messy or didn’t follow the rules, they threw it away).
They then taught the model with supervised fine-tuning (SFT)—basically showing it many good examples until it learned the pattern—using an open-source base model called Qwen3-30B-A3B.

Key terms explained:

Long-horizon task: a problem that needs many steps to solve (dozens or even hundreds).
Context: the “memory” the AI uses—past actions, observations, and summaries.
Folding: the agent’s way of cleaning up and organizing memory at the right moments, either by condensing a step or merging many steps.

What did they find, and why does it matter?

AgentFold did very well on tough web-browsing tests:

BrowseComp: 36.2%
BrowseComp-ZH (Chinese): 47.3%
WideSearch: 62.1%
GAIA (general tasks, text-only subset): 67.0%

Why this is impressive:

It beat or matched much larger open-source models (some over 20 times bigger), like DeepSeek-V3.1-671B.
It also outperformed a leading proprietary agent (OpenAI’s o4-mini) on some benchmarks.
Its memory stayed compact even in long tasks: after 100 steps, the context stayed around “7k tokens” (think: a manageable notebook), not exploding in size like typical agents.
It can keep working well for hundreds of steps (tested up to 500 turns), which is great for deep research tasks.

In short: managing memory proactively helped the agent stay focused, use fewer resources, and solve longer, harder problems more reliably.

Why is this important for the future?

If AI agents can organize their own memory on the fly, they can:

Do serious, multi-hour research without getting “lost.”
Work efficiently, using less compute and memory.
Scale to very long tasks—bigger projects, better accuracy, and more stable results.

The authors suggest the next step is to combine this with reinforcement learning (RL), so the agent can discover even smarter folding strategies by directly practicing and optimizing for success.

The big takeaway

AgentFold shows that how an AI manages its memory can matter as much as how big the AI is. By “folding” its notes like a careful student—saving important details and wrapping up finished parts—AgentFold stays sharp over long tasks and can outperform much larger models.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper, organized by theme to guide follow-up research.

Method and Design Limitations

No mechanism to “unfold” or retrieve pruned details after a deep consolidation; unclear how the agent recovers if critical information is lost or mis-summarized.
Lack of principled criteria for selecting the folding range k and the aggressiveness of folding; no learned utility model or confidence-aware policy for deciding granular vs deep consolidation.
Latest Interaction only retains the immediately preceding step; unexplored whether a short rolling window (last K full steps) or adaptive windowing would reduce near-term information loss.
No ablation isolating the contributions of Granular Condensation vs Deep Consolidation vs both; unclear which operation drives gains on which task regimes.
No analysis of the effect of the “thinking/explanation” components on folding quality and downstream performance (e.g., with/without visible CoT, distilled rationales).
Absent theoretical guarantees or bounds on context growth (e.g., expected token/block complexity over horizon H under different folding policies).
Folding directives depend on strict structured output and parsing; robustness to minor format deviations or parser failures is not studied.

Data and Training Pipeline Gaps

Fold-Generator data quality, diversity, and bias are under-specified; no diagnostics on coverage of task types, failure modes, or linguistic/style diversity of summaries.
Rejection sampling ensures format compliance but may bias trajectories toward “clean” paths; impact on generalization to noisy or atypical web interactions is unknown.
Potential overfitting to generator model style: no experiments with cross-generator training or style-randomization to test robustness.
No ablation on dataset size/composition vs performance (how many steps/trajectories are needed; which folds are most learnable).
The generated trajectories lack ground-truth labels for “fold correctness”; no human or automated verification that summaries preserve key facts.
Unclear whether the training data or evaluation benchmarks are contaminated by overlaps with generator outputs or web content used in data creation.

Evaluation and Experimental Limitations

Limited metrics: beyond final task score and token/block counts, no measures of summary fidelity, factual preservation, or error propagation due to folding.
No significance testing, confidence intervals, or variance analysis across runs for large benchmarks; robustness of reported gains is unclear.
In scaling-to-500 turns, success rates and accuracy are not reported—only context length dynamics—leaving the utility of very long horizons unquantified.
Comparisons may be confounded by heterogeneous toolchains/environments across baselines; the browser/tool setup is insufficiently specified for fair reproducibility.
No cost/latency profiling (e.g., tokens processed, wall-clock time, memory) per turn and per horizon; the compute benefits of folding remain anecdotal.
No head-to-head ablations against stronger, learned memory baselines (e.g., episodic memory stores, retrieval-augmented memories, memory graphs) under identical conditions.
Lack of cross-domain testing (e.g., code search, academic literature, multi-hop QA with structured sources); generality beyond web browsing benchmarks is unproven.
No human evaluation of readability, auditability, or traceability of state summaries for debugging and oversight.

Robustness, Safety, and Reliability Gaps

No evaluation under adversarial web settings: prompt injection, malicious scripts, content hijacking, cloaked pages, or misleading metadata.
No paper of robustness to dynamic/non-stationary environments (content changes between steps, rate-limits, intermittent tool failures).
No mechanisms to detect and correct erroneous or contradictory folds; absence of self-verification or rollback strategies after mis-consolidations.
Hallucination control and misinformation propagation are not assessed; folding could compress and lock in fabricated or biased content.
Privacy/compliance and data retention concerns are unaddressed, especially given persistent multi-scale summaries that may store sensitive content.

Scalability and Generalization Questions

Portability to other base LLMs (smaller or larger, different architectures) is untested; sensitivity to instruction-following and reasoning strength is unknown.
Interaction with multimodal inputs (images, PDFs, tables, embedded maps) is unexplored; can folding handle heterogeneous content effectively?
Interoperability with external knowledge bases or vector stores (combining intra-task folding with external memory retrieval) is untested.
Team/interactive settings (multi-agent collaboration, task handoff, shared memory) are not considered; how folding integrates in collaborative workflows is open.
Persistence across sessions (inter-task memory) and lifelong adaptation are not addressed; folding is scoped to within-task context only.

Policy Learning and Control Open Questions

The paper proposes RL as future work; currently no evidence on whether RL or bandit-style feedback improves folding policies over SFT, nor on safe reward design.
No uncertainty-aware or risk-sensitive folding policy; how to tie folding decisions to calibrated confidence or downstream utility remains open.
No exploration of adaptive budgets (token/time/turn) that trade off folding aggressiveness and action depth dynamically per task.

Reproducibility and Transparency

Exact tool configurations, browsing stack, and environment setup (e.g., rendering engine, JavaScript handling, anti-bot measures) are insufficiently documented for replication.
Availability of the Fold-Generator code, trained datasets, and full logs is unclear; without them, independent verification of folding behaviors is difficult.
Criteria for forced termination and its effect on reported performance (especially for tasks nearing the turn limit) are not disentangled from agent competence.

These gaps suggest concrete next steps: design reversible or uncertainty-aware folding; build ground-truthed fold-fidelity datasets; ablate folding modes and training data properties; add rigorous robustness/safety evaluations; report cost-performance curves; test across models, domains, modalities, and memory baselines; and introduce RL or utility-driven policies for when and how to fold.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are practical use cases that can be deployed now using the paper’s AgentFold paradigm and tooling.

Enterprise competitive intelligence and market landscape analysis (finance, software, manufacturing)
- What: Long-horizon web research across hundreds of sources; AgentFold’s deep consolidation reduces noise while granular condensation preserves critical facts, producing Multi-Scale State Summaries as a navigable briefing.
- Product/workflow: “AgentFold CI Analyst” with a fold-aware research brief UI; headless browser + search APIs + exportable summary blocks.
- Assumptions/dependencies: Open-web access, anti-bot compliance (CAPTCHAs, rate limits), paywall constraints, human-in-the-loop validation for high-stakes decisions.
Academic literature reviews and meta-analyses (academia, healthcare)
- What: Tiered summaries of papers and sub-topics; failed leads folded into concise conclusions; key methods/results preserved at fine-grain.
- Tools/workflows: PubMed/arXiv APIs; “Folded Literature Map” with per-subtopic blocks; citation capture and deduplication.
- Assumptions/dependencies: Access to licensed content; reproducibility checks; domain expert oversight for clinical or statistical claims.
Legal due diligence and e-discovery triage (policy/legal)
- What: Multi-source review (case law, filings, regulatory guidance) with explicit folding directives to compress repetitive or unproductive trails and retain pivotal precedents.
- Product/workflow: “Folded Case Digest” with block-level audit trail; export to matter management systems.
- Assumptions/dependencies: Legal accuracy and confidentiality; paywall/licensing; human attorney review; document quality/OCR fidelity.
Investigative journalism and OSINT (media, policy)
- What: Cross-source verification over extended browsing; timeline synthesis via multi-scale summaries; explicit reasoning trails support editorial review.
- Product/workflow: “AgentFold OSINT Desk” with provenance tracking for each folded block.
- Assumptions/dependencies: Misinformation risk; need for independent fact-checking; source reliability scoring.
Customer support knowledge base curation (software/SaaS)
- What: Mine tickets/forums/docs; consolidate recurring issues into KB articles while preserving edge-case details in granular blocks.
- Tools/workflows: Helpdesk integrations; “Context Folding KB Builder” that turns long threads into structured resolutions.
- Assumptions/dependencies: Access to internal data, privacy/compliance, PII redaction.
Cyber threat intelligence scanning (cybersecurity)
- What: Continuous monitoring of indicators across forums, repos, and advisories; deep consolidation suppresses noise while retaining confirmed IoCs and TTPs.
- Product/workflow: “Folded CTI Watch” dashboards with block-level provenance.
- Assumptions/dependencies: Secure browsing sandbox; rate limits; integration with SIEM/SOAR; analyst validation.
SEO/content audit and competitive site reviews (marketing)
- What: Crawl and compare hundreds of pages; consolidate redundant patterns (e.g., thin content) and preserve exemplar pages in high-fidelity blocks.
- Tools/workflows: Headless crawler + SERP APIs; exportable content gap summary.
- Assumptions/dependencies: Robots.txt compliance; site permissions; accurate parsing/rendering.
Procurement/vendor evaluation dossiers (manufacturing, energy, public sector)
- What: Compile specs, certifications, financials; fold dead-end checks; retain critical compliance details as fine-grained blocks.
- Product/workflow: “AgentFold Vendor Dossier” with multi-scale evidence trails.
- Assumptions/dependencies: Source reliability; up-to-date certifications; human review for awards and risk.
Personal research assistant for complex decisions (daily life)
- What: Multi-step product comparisons and travel planning; long-horizon exploration with compact context and clear trade-off summaries.
- Tools/workflows: Browser plug-in; integrations to booking or shopping sites.
- Assumptions/dependencies: API integrations; authentication/paywall handling; user privacy.
Cost-efficient agent orchestration and LLM app tuning (software/ML ops)
- What: Integrate AgentFold’s proactive folding into existing agent stacks (e.g., LangChain/LlamaIndex) to cut context size and inference cost at long horizons.
- Tools/workflows: “Context Folding SDK” and middlewares that parse/apply folding directives (JSON).
- Assumptions/dependencies: Long-context model availability; tool execution reliability; telemetry for fold efficacy.

Long-Term Applications

The following applications are viable but need further research, scaling, RL-based policy learning, stronger integrations, or governance.

Autonomous scientific discovery assistants and living reviews (academia, healthcare)
- What: Closed-loop hypothesis formation across literature, data, and lab notebooks; multi-scale memory of experiments, negative results, and evolving rationale.
- Tools/products: “AgentFold Science Pod” with ELN/LIMS integration; RL to learn non-obvious folding policies.
- Dependencies: Lab safety, experimental APIs, rigorous validation; regulatory approvals for medical domains.
Regulatory surveillance and compliance copilots (policy/finance/healthcare)
- What: Continuous, multi-jurisdiction monitoring of evolving rules; retain authoritative clauses at granular level and fold minor updates.
- Product/workflow: “AgentFold Compliance Radar” with audit-ready folding logs and explainable summaries.
- Dependencies: Timely access to official sources; legal review; governance for auditability and accountability.
Enterprise-grade RPA for semi-structured web workflows (operations/software)
- What: Months-long tasks spanning procurement, onboarding, licensing checks; folding maintains tractable memory and allows course corrections after dead ends.
- Tools/products: Integration with ERP/CRM; reliability enhancements (retry policies, CAPTCHA solving).
- Dependencies: Robustness to site changes; safety/guardrails; task-level RL.
Clinical evidence synthesizers and living guidelines (healthcare)
- What: Continually updated systematic reviews; preserve pivotal trial details while folding incremental updates.
- Tools/products: Guideline authoring platforms; traceable fold logs for medical boards.
- Dependencies: Content licensing; bias control; expert oversight; patient safety considerations.
Crisis response OSINT fusion centers (public safety/policy)
- What: 24/7 multilingual monitoring of social, news, and official channels; deep consolidation for evolving situational summaries.
- Tools/products: “AgentFold Situational Awareness” with geotagging and credibility scoring.
- Dependencies: Robust misinformation handling; secure infrastructure; cross-agency workflows.
Autonomous product design and trend research (industry/design)
- What: Long-horizon scanning of patents, forums, catalogs; multi-scale memory of design patterns and constraints.
- Tools/products: CAD/PLM integrations; IP risk tracking in folded blocks.
- Dependencies: IP compliance; expert validation; domain-specific toolchains.
Personal long-term memory managers (daily life/education)
- What: Lifelong assistants that fold digital traces (emails, notes, browsing) into multi-scale summaries for reflection, learning, and planning.
- Tools/products: Privacy-preserving local agents; fold-aware personal knowledge graphs.
- Dependencies: PII protection, consent; on-device inference or secure cloud; user control over folding policies.
High-level task memory for embodied agents (robotics)
- What: Apply proactive context folding to task-level plans and failures over hundreds of steps (e.g., household tasks, inspections).
- Tools/products: “Folded Task Memory” modules attached to planners.
- Dependencies: Sensor/action integration; mapping to spatial/temporal representations; safety and verification.
Energy market and grid operations analysis (energy/finance)
- What: Long-horizon monitoring of markets, weather, outages; retain critical events; fold routine fluctuations.
- Tools/products: “AgentFold Grid Intel” with data feeds and operator consoles.
- Dependencies: Real-time data licenses; model calibration; operator oversight.
Fold-aware governance and audit suites for AI agents (software/policy)
- What: Standardized folding directive logs (JSON) for compliance, reproducibility, and post-hoc review across agent ecosystems.
- Tools/products: “AgentFold Governance Suite” with policy engines; cross-agent “Folding API” standard.
- Dependencies: Industry standards; regulatory alignment; secure logging and provenance.

Cross-cutting assumptions and dependencies

Model capability and context: Performance depends on a strong base LLM with sufficient context window; folding reduces token pressure but still requires long-context models.
Tooling: Reliable headless browsing, search APIs, data parsing, and rate-limit handling; robust error recovery aligned with the folding policy.
Data access and licensing: Paywalls, authentication, and content rights may constrain deployment; multilingual performance requires strong translation.
Accuracy and oversight: Human-in-the-loop validation for high-stakes domains (legal, healthcare, policy); fact-checking and bias control are essential.
Governance and safety: Auditability via structured folding directives; PII handling and privacy; alignment and guardrails for autonomous long-horizon behavior.
Scaling: RL and domain-specific fine-tuning can improve folding decisions; industrialization requires MLOps, monitoring, and reliability engineering.

View Paper Prompt View All Prompts

Glossary

AgentFold: A web agent paradigm that proactively manages its context via learned folding to handle long-horizon tasks efficiently. "we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation."
append-only context: A logging strategy that accumulates all past interactions without pruning, often causing bloated, noisy context. "However, the append-only context inherent to the ReAct paradigm leads to context saturation on long-horizon tasks, impairing reasoning as critical signals become buried in noise."
cognitive workspace: The agent’s structured internal context, actively curated to support reasoning and action. "AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled."
context saturation: The degradation of agent performance due to excessive, noisy context accumulation. "Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories,"
deep consolidation: A folding operation that merges multiple prior steps and the latest interaction into a single coarse summary. "or as a deep consolidation, it fuses the Latest Interaction with a chain of prior summaries, retracting these specific entries and replacing them with a single abstraction at a coarser strategic scale."
dynamic 'look-back' mechanism: A retrospective process where the agent revisits past steps to distill insights and discard irrelevancies. "This involves a dynamic `look-back' mechanism: after several actions, irrelevant steps are discarded, intermediate findings are distilled, and key insights are abstracted."
Fold-Generator: A specialized data collection pipeline that produces trajectories demonstrating structured folding for training. "To this end, we develop Fold-Generator, a specialized LLM-oriented data collection pipeline that can automatically generate trajectories for training."
folding directive: A structured command specifying how to update state summaries (range and replacement summary) at each step. "This folding directive has a dual (two-scale) character:"
folding operation: The learned action of compressing or consolidating context segments during task execution. "it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales"
GAIA: A benchmark for evaluating general AI assistant capabilities. "and 67.0% on general benchmark GAIA"
granular condensation: A folding operation that compresses only the latest step into a fine-grained summary while preserving key details. "it can perform granular condensations to preserve vital, fine-grained details"
Intra-Task Context Curation: Managing and refining the context generated within the current task to maintain relevance over long horizons. "Our work, in contrast, pursues Intra-Task Context Curation, which focuses on managing the context generated within the task itself to maintain relevance and efficiency over long horizons."
Item-F1: A task-specific evaluation metric (F1 score at the item level) used in WideSearch. "WideSearch-en (the most detailed metric: Item-F1)"
JSON object: The data format used to encode the folding directive’s range and summary. "It takes the form of a JSON object:"
Latest Interaction: The complete record of the most recent step (explanation, action, observation) serving as working memory. "and the \textcolor{red!70}{Latest Interaction}, which is the complete record of the most recent action and observation."
long-horizon tasks: Tasks requiring many steps and sustained reasoning, where context management is crucial. "LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management."
long-term memory: The agent’s curated, abstracted summaries of past steps used for coherent long-range reasoning. "The \textcolor{blue!70}{Multi-Scale State Summaries} function as the agent's curated long-term memory."
Multi-Scale State Summaries: A sequence of condensed summaries at varying granularities capturing the historical trajectory. "AgentFold's workspace (i.e., context) is explicitly partitioned into the invariant user question, the curated \textcolor{blue!70}{Multi-Scale State Summaries} representing long-term memory, and the high-fidelity \textcolor{red!70}{Latest Interaction} serving as the immediate working memory."
rejection sampling: A filtering mechanism that discards generated steps or trajectories that violate format or contain errors. "we leverage a rejection sampling mechanism, discarding any generated step that fails to strictly adhere required formats, or any trajectory that contains too many environmental errors."
retrospective consolidation: A cognitive process of integrating and abstracting past information to support sustained reasoning. "inspired by the human cognitive process of retrospective consolidation."
ReAct: A paradigm where agents iterate in a reasoning–action–observation loop, typically with append-only history. "Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories"
Supervised Fine-Tuning (SFT): A training method that fine-tunes models on curated trajectory pairs to internalize structured responses. "This curated dataset is then used for conducting conventional Supervised Fine-Tuning (SFT) on open-source LLMs."
tool call: An external action invoking a specified tool with arguments to obtain an observation for the next step. "Simultaneously, the resulting observation from the executed tool call then, combined with the action, constitutes the new Latest Interaction for the subsequent cycle."
trajectory: The ordered sequence of agent steps (reasoning, actions, observations) that defines progress on a task. "It operates not on a monolithic log, but on a dynamic trajectory composed of \textcolor{blue!70}{Multi-Scale State Summaries}—several distilled records of past events—and the \textcolor{red!70}{Latest Interaction}"
uniform full-history summarization: A policy that summarizes the entire history at every step, risking the loss of crucial details. "AgentFold's design offers a novel approach to context management, resolving the trade-off between the append-only history of ReAct, which leads to context saturation, and uniform full-history summarization, which risks irreversible information loss."
working memory: The immediate, high-fidelity memory of the latest interaction used for short-term decision-making. "The \textcolor{red!70}{Latest Interaction} acts as a high-fidelity working memory."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (15)

Collections

Tweets

This paper has been mentioned in 8 tweets and received 84 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

AgentFold: Long-Horizon Web Agents with Proactive Context Management (2510.24699v1)

Summary

AgentFold: Proactive Context Management for Long-Horizon Web Agents

Introduction and Motivation

AgentFold Architecture and Context Management

Training Methodology

Experimental Results

Theoretical and Practical Implications

Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

A simple explanation of “AgentFold: Long-Horizon Web Agents with Proactive Context Management”

What is this paper about?

What questions did the researchers want to answer?

How does AgentFold work? (Methods in plain language)

What did they find, and why does it matter?

Why is this important for the future?

The big takeaway

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Method and Design Limitations

Data and Training Pipeline Gaps

Evaluation and Experimental Limitations

Robustness, Safety, and Reliability Gaps

Scalability and Generalization Questions

Policy Learning and Control Open Questions

Reproducibility and Transparency

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (15)

Collections

Tweets

YouTube