- The paper demonstrates that semantic-level context pruning and summarization significantly improve task completion, achieving 91.6% full itemization.
- The methodology uses recency-based pruning combined with selective summarization to prevent context overflow while reducing token usage and execution time.
- The approach implies that calibrated context management enhances efficiency and scalability for LLM agents in complex, tool-centric enterprise workflows.
Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents
Introduction
"Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents" (2606.10209) introduces and systematically evaluates semantic-level context management strategies for LLM agents operating in long-horizon, multi-step enterprise workflows—specifically, tool-using agents interacting with verbose enterprise systems such as Microsoft Dynamics 365 Finance and Operations (D365 F&O). The study addresses the acute challenge of context window overflow in agentic tool use, where the accumulation of large, partially redundant tool responses not only strains memory and inference cost but degrades downstream performance (context rot). The paper defines, implements, and rigorously benchmarks recency-based pruning and selective summarization context policies as practical remedies that improve both task success and efficiency.
LLM agents automate complex, step-wise tasks by invoking tool APIs (via the Model Context Protocol, MCP) and managing intermediate state through dialogue history. In enterprise environments, each tool response returns extensive structured data—form snapshots, metadata, and nested system information—that dramatically increases token budget requirements for context retention as workflows lengthen. Full conversation history retention, while preserving all information, quickly becomes untenable both due to context window constraints and linear increases in inference costs.
Prior token-level prompt compression approaches (e.g., LLMLingua (Jiang et al., 2023), Selective Context (Li, 2023)) operate below the semantic granularity necessary for tool-centric workflows and can degrade form state integrity. External memory architectures (e.g., MemoryBank [zhong2024memorybank], LongMem [wang2024longmem]) are suited for factual recall across sessions, not for maintenance of current, decision-relevant tool state within a single task episode. The most relevant prior line comprises learned or compaction-based agent context policies (e.g., ACON (Kang et al., 1 Oct 2025)); however, these often involve more machinery and lack evaluation on stringent enterprise task benchmarks with hard completion criteria.
Methods
Task Setup
The primary benchmark task is hotel receipt expense itemization in D365 F&O: decompose a receipt into line items (mapping free-form names to fixed subcategories), accurately allocate amounts, and fully reconcile the report, achieving a zero residual. Each benchmark task includes multiple line items, repeated subcategory names, and nontrivial mapping challenges—all requiring not just raw recall but correct sequential tool-driven reasoning.
System and Policy Configurations
Four agent configurations are evaluated:
- No User Model (C1): GPT-5 receives a static prompt and executes unassisted; serves as ablation.
- Full Context (C2): GPT-5 with a user model (GPT-4.1) and retention of the full conversation history—standard baseline practice.
- Last 5 Tool Calls (C3): Pruning to retain only the 5 most recent tool call/response pairs; all prior context is dropped.
- Last 5 Tool Calls + Summarization (C4): Last 5 interactions retained, with earlier context compressed via a summarizer LLM operating over a 3-pair summary window.
The context construction algorithm operates deterministically at the message (tool pair) level, evicting older interactions and, if enabled, inserting a single summary message generated from the most recently evicted interactions.
Metrics and Evaluation
Metrics include the primary business-critical measure (percentage of tasks with full itemization and zero residual), as well as other task quality indicators and comprehensive efficiency statistics (total/input/output token count, execution time). Each configuration is evaluated on 50 tasks, with results averaged across five runs. Additional experiments probe sensitivity to window/summarization parameters, generalization across related expense categories, and cross-model portability (Claude Sonnet 4.5).
Results
Impact of Context Engineering
Both performance and efficiency are strongly affected by context policy. Summary of high-level results on the hotel benchmark:
| Config |
Complete Itemization (%) |
Tokens (K) |
Time (hrs) |
| C1 (no user) |
8.0 |
532.6 |
3.08 |
| C2 (full) |
71.0 |
1,481.0 |
14.56 |
| C3 (prune) |
79.0 |
535.3 |
5.39 |
| C4 (prune+sum) |
91.6 |
553.4 |
5.79 |
Key findings:
- Context pruning (C3) increases complete itemization to 79.0% (+8pts over C2), while reducing tokens by 63.9% and execution time by 63.0%.
- Adding summarization (C4) further increases task completion to 91.6% (+12.6pts over C3) with just 3.4% additional token cost.
- Full context (C2) incurs both lower accuracy and prohibitive inefficiency: more context does not equate to better agentic performance in this class of workflow.
Statistical analysis confirms the robustness of these improvements: in particular, the C4 vs. C3 pooled 95% Wilson confidence intervals do not overlap, confirming the superiority of recency+summarization over pruning alone.
Failure Taxonomy
Failure analysis over incomplete runs reveals clear, policy-driven trends:
- Full context (C2): high rate of stale-state reference errors (agent acts on superseded form data).
- Pruning only (C3): reduction in stale-state errors, but increased premature task termination (loss of global task awareness).
- Pruning + summarization (C4): significant suppression of both error types, with most residual failures arising from model-level issues (e.g., ambiguous subcategory mappings) not directly addressable by context policy.
Generalization and Sensitivity
- Results are consistent across structurally distinct expense task types (hotel, travel, meals/gifts), with C4 achieving 91–96% complete itemization and >60% token savings over full context in all domains.
- Hyperparameter sweeps indicate that accuracy plateaus at N=5 (pruning window) and W=3 (summarization window); further increasing these incurs unwanted token cost with no meaningful performance gain.
- Claude Sonnet 4.5, when tested on the same benchmark, does not exhibit the GPT-5-specific stalling observed without a user model, but still follows the improvement ordering: pruning + summarization reliably raises performance (full context: 88.0%, prune: 92.0%, prune+summarize: 94.5%).
Implications and Future Directions
This work establishes that, for complex, sequential tool-using workflows with verbose API responses, semantic-level context engineering—retaining only a recent window of tool interactions and supplementing pruned context with summarization—outperforms both full-history retention and pruning-alone strategies on both cost and accuracy. The agent's working memory is effectively calibrated to the task's sequential structure, obviating the need for all historical state which, if retained, can actively degrade performance through stale or conflicting cues.
Practically, this enables highly efficient, scalable LLM agent deployment for enterprise automation: token budgets for high-reliability runs are reduced by 2.7x or more, enabling orders-of-magnitude cost and throughput gains in production settings without sacrificing correctness. The context engineering approach is model-agnostic, inference-time only (no model retraining), and interpretable (retained context and summaries are readable and auditable).
Theoretically, the results challenge the assumption that more context is generally beneficial for tool-using LLMs—a principle unlikely to hold in domains where old states are repeatedly superseded and system state evolves through a fixed-API interaction loop.
Open questions for future investigation include:
- Optimal context window selection as a function of task complexity and tool API structure.
- Integration of structured or learned summarization/compression modules for further efficiency gains.
- Application to domains beyond expense management (CRM, healthcare administration, supply chain).
- Adaptive policies that dynamically tune pruning and summarization in response to task progress or model uncertainty signals.
Conclusion
By formalizing and empirically validating recency-based pruning and summarization context policies for tool-rich, long-horizon LLM agents, this paper provides strong evidence that semantic-level context engineering is a critical enabler for efficient, effective agentic automation in enterprise environments. The results generalize across categories and model families, highlighting that carefully crafted context windows, rather than maximal retention, are key to high-reliability, cost-effective agent deployment in real-world, tool-oriented scenarios (2606.10209).