FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents (2510.03204v1)

Published 3 Oct 2025 in cs.CL

Abstract: Web agents powered by LLMs must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

Summary

The paper introduces an LLM-based retriever that dynamically prunes AxTree observations to balance context relevance with computational efficiency.
It achieves a mean token reduction of over 50% while maintaining task success rates near baseline performance across benchmarks.
The security evaluation demonstrates a drastic reduction in prompt injection success, showcasing enhanced safety in web agent operations.

FocusAgent: LLM-Based Pruning for Efficient and Secure Web Agents

Motivation and Problem Setting

Web agents leveraging LLMs must process observations derived from complex, real-world web pages, often represented as accessibility trees (AxTrees) with tens of thousands of tokens. This scale saturates LLM context windows, increases inference cost, and exposes agents to prompt injection attacks. Existing observation pruning strategies—semantic similarity-based retrieval and naive truncation—either discard relevant content or retain excessive irrelevant context, leading to suboptimal action prediction and increased vulnerability. The FocusAgent framework addresses these challenges by introducing an LLM-based retriever that dynamically prunes AxTree observations, guided by the task goal and optionally the agent’s action history, to retain only the most relevant lines for each decision step.

System Architecture and Retrieval Pipeline

FocusAgent operates in a two-stage pipeline. At each agent step, a lightweight LLM retriever receives a prompt containing the current task goal, a line-numbered AxTree observation, and optionally the interaction history. The retriever generates a chain-of-thought (CoT) rationale and selects line ranges deemed relevant for the current decision. Post-processing removes irrelevant lines, optionally inserting placeholders to maintain AxTree structure. The pruned observation is then passed to the main agent LLM for action prediction.

Figure 1: Overview of the FocusAgent pipeline, illustrating the retrieval and pruning stages, and the mitigation of prompt injection attacks via observation sanitization.

Figure 2: Step-wise operation of the FocusAgent retriever, showing line numbering, CoT-based relevance selection, and AxTree pruning for a WebArena task.

This architecture is agnostic to the retriever’s context window, as long AxTrees can be chunked and processed sequentially. The retrieval prompt is designed to favor recall over precision, encouraging the inclusion of potentially relevant lines in ambiguous cases.

Empirical Evaluation: Efficiency and Effectiveness

FocusAgent is evaluated on the WorkArena L1 and WebArena benchmarks using the BrowserGym framework. The agent is compared against three baselines: (1) bottom-truncation (GenericAgent-BT), (2) embedding-based retrieval (EmbeddingAgent), and (3) BM25 keyword-based retrieval (BM25Agent). All agents are evaluated under standardized step and context constraints.

Key findings:

Observation size reduction: FocusAgent achieves a mean pruning rate of 51–59% on AxTree tokens, with some instances exceeding 80% reduction, while maintaining performance close to the full-context baseline.
Task success rate (SR): On WorkArena L1, FocusAgent achieves 51.5% SR (GPT-4.1 backbone), nearly matching the 53.0% of GenericAgent-BT, but with a 51% reduction in observation size. On WebArena, FocusAgent achieves 32.3% SR (GPT-4.1), compared to 36.5% for the baseline, with a 59% reduction in context.
Retrieval method comparison: Embedding and BM25-based retrievals yield similar pruning rates but suffer a >10% absolute drop in SR, indicating that LLM-based retrieval is superior for step-wise, context-sensitive pruning in interactive environments.
Figure 3: Success rate versus average pruning across agents, highlighting the trade-off between context reduction and task performance.

Figure 4: Distribution of original versus pruned AxTree token counts, demonstrating the variability and effectiveness of FocusAgent’s pruning across tasks.

Security: Mitigating Prompt Injection Attacks

Prompt injection attacks, such as adversarial banners and pop-ups, are a critical threat to LLM-based web agents. FocusAgent’s retrieval mechanism inherently sanitizes observations by removing irrelevant or suspicious lines, thereby reducing the attack surface.

Experiments using the DoomArena framework on WebArena Reddit tasks demonstrate:

Attack success rate (ASR): DefenseFocusAgent (FocusAgent variant with an attack warning prompt) reduces ASR from 32.4% (banner) and 90.4% (popup) for GenericAgent to 0.9% and 1.0%, respectively, with GPT-4.1 backbone.
Task success under attack: DefenseFocusAgent maintains a task SR of 42.1% (banner) and 2.0% (popup), whereas the baseline with a guard layer (LLM-judge) halts all tasks under attack, yielding 0% SR.
Trade-off: While DefenseFocusAgent does not fully eliminate all attacks (especially in degenerate page states), it achieves a strong Pareto improvement in the ASR–SR space compared to prior defense strategies.

Figure 5: Examples of banner and popup prompt injection attacks in DoomArena.

Figure 6: Example of a banner attack on WebArena Reddit Task 27, with malicious instructions embedded in SVG alt fields.

Figure 7: Example of a popup attack on WebArena Reddit Task 27, with adversarial text in a non-visible AxTree element.

Ablation: Retrieval and Formatting Strategies

Ablation studies reveal that:

Prompting strategy: Soft retrieval prompting (favoring recall) yields the best trade-off between pruning and SR. Aggressive or neutral strategies increase pruning but degrade performance, especially on more diverse benchmarks.
History inclusion: Including the agent’s action history in the retriever prompt does not improve, and may even degrade, retrieval quality—likely due to confusion from verbose CoT traces.
AxTree formatting: Removing irrelevant lines entirely maximizes pruning, but retaining minimal structure (e.g., bid and role placeholders) can slightly improve SR at the cost of reduced pruning efficiency.

Figure 8: Distribution of token pruning ratios across AxTrees, illustrating the variability in pruning effectiveness.

Figure 9: Pruning ratio distribution, showing the frequency of different pruning levels across tasks.

Theoretical and Practical Implications

FocusAgent demonstrates that LLM-based, step-wise retrieval is a practical and robust strategy for observation pruning in web agents. Unlike static semantic similarity or keyword-based methods, LLM retrievers can leverage planning context and user intent, preserving elements critical for action prediction and state tracking. This approach not only improves computational efficiency and cost but also provides a first line of defense against prompt injection, without the utility-security trade-off inherent in external guard layers.

The results challenge the assumption that classic retrieval methods suffice for interactive environments, and highlight the necessity of context-sensitive, LLM-driven observation filtering. The findings also suggest that further improvements in agent robustness may be achieved by integrating more sophisticated threat detection into the retrieval prompt, or by combining LLM-based pruning with structural sanitization of interactive elements.

Future Directions

Scalability: As web applications and AxTrees continue to grow, further research is needed on chunked or hierarchical retrieval strategies for extremely long contexts.
Adaptive security: Dynamic adjustment of retrieval aggressiveness based on detected threat likelihood could further optimize the utility-security trade-off.
Integration with multimodal agents: Extending LLM-based retrieval to multimodal observations (e.g., screenshots, OCR) may enhance agent robustness in visually rich environments.
Generalization: Investigating the transferability of retrieval prompts and strategies across domains and agent architectures will be critical for broad adoption.

Conclusion

FocusAgent establishes that targeted, LLM-based retrieval is an effective mechanism for both efficient context management and security in web agents. By pruning over 50% of AxTree tokens while maintaining near-baseline performance and dramatically reducing prompt injection success rates, FocusAgent provides a strong foundation for the next generation of robust, deployable web agents. The approach’s simplicity, generality, and empirical effectiveness position it as a key component in the design of scalable, secure LLM-based agents for real-world web automation.