Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets

Published 24 May 2026 in cs.SE, cs.AI, and cs.CL | (2605.26165v1)

Abstract: Agentic RAG systems that equip LLMs with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval-augmented generation. We present the first systematic study of this tool-context trade-off, evaluating 14 models spanning 1.5B-32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative-profile compression (44-50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON-schema tool definitions overflow the context window entirely, yielding near-zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact-match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K -- where both formats fit -- four of five tested models show delta <= 1 pp, confirming the effect is purely budget-driven. External validation on HotpotQA (50 multi-hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool-schema compression as a necessary infrastructure layer for agentic RAG in constrained-context deployments. All code, data, and checkpoints are publicly available.

Authors (1)

Summary

  • The paper shows that Tscg compression restores context budget, boosting agentic RAG EM performance by approximately 20.5 points under 8K constraints.
  • It employs a controlled study across 14 models with synthetic and real-world benchmarks to rigorously evaluate the impact of schema compression.
  • Results indicate that schema compression is a necessary infrastructural layer for scalable agentic RAG deployments in context-constrained environments.

Tool-Schema Compression as a Prerequisite for Agentic RAG Under Context Constraints

Introduction and Problem Formulation

The intersection of tool-augmented LLMs and retrieval-augmented generation (RAG) has led to the development of agentic RAG systems capable of dynamic tool selection, API invocation, and evidence synthesis within a unified context window. Contemporary production deployments often expose 20–100+ tools via protocols such as the Model Context Protocol (MCP), yielding rigid requirements on context window allocation among system prompts, tool schemas, retrieval chunks, conversation history, and generative output.

A fundamental bottleneck arises from the token cost of tool schemas—commonly formatted as verbose JSON Schema objects—that can individually consume hundreds of tokens. At moderate tool counts (e.g., 28 tools), schema blocks can exceed 11,000 tokens. This not only overflows typical 8K or 16K context windows but also severely restricts space available for retrieval-augmented generation, contradicting assumptions prevalent in existing RAG literature that tool definitions fit easily within context.

Methodology

The paper establishes the first controlled study of the tool-context budget trade-off in agentic RAG. The authors evaluate 14 models (ranging from 1.5B to 32B parameters), plus a major frontier API model (Claude Sonnet 4), across 6,566 API calls at varying context budgets (8K, 16K, 32K) with a fixed set of 28 tool definitions. The core intervention is the application of Tscg—a deterministic, rule-based schema compressor (Sakizli, 4 May 2026)—yielding 44–50% token savings over raw JSON schemas.

The evaluation uses a synthetic benchmark (NovaTech-28) designed to emulate typical enterprise workflow: 28 tools spanning databases, search, computation, and communication, an associated corpus of 40 retrieval chunks, and 100 queries reflecting realistic single/multi-hop and tool-use scenarios. The agent leverages a ReAct-style architecture, and all experiments rely on paired-prompt designs, rigorous significance testing (Wilcoxon signed-rank, Cohen's dd), and bootstrap confidence intervals.

Experimental Results

Binary Enablement Under Tight Budgets

At an 8K token context window, conventional (uncompressed) JSON schemas for 28 tools alone exceed available context, yielding context overflow and near-zero exact match (EM) accuracy (average 2.6% EM across models). Application of Tscg compression restores enough budget to fit minimal retrieval evidence, resulting in a mean EM gain of +20.5 points across all models (reaching +24.7 for a fully enabled subset). This binary enablement—transitioning from total system inoperability to meaningful RAG functionality—constitutes the paper's primary result.

At a 32K window, both raw and compressed schemas fit comfortably with room for all retrieval chunks, and the performance difference vanishes for four out of five models (∣Δ∣≤1|\Delta| \leq 1 EM), confirming that the observed effect is purely context-budget driven rather than an artifact of compression per se.

16K and the Saturation Plateau

At 16K, Tscg compression alleviates retrieval bottlenecks by tripling the available chunk slots (from 6–11 to 25–28), but accuracy improvements are negligible for the majority of models. This plateauing stems from empirical saturation in the synthetic benchmark—most queries become retrieval-saturated by 9 chunks, and further context inclusion introduces diminishing (or even negative) returns due to distractor dilution, especially in smaller models (≤\leq8B parameters).

External and Frontier Validation

Testing on HotpotQA (multi-hop QA) with the same tool set at 8K resulted in a 48-point EM lift when Tscg is used, confirming generalization to independent questions and more information-dense passage formats.

Scaling experiments with up to 800 synthetic tools using a 200K token context demonstrate that uncompressed JSON schemas overflow the context at approximately 494 tools, while Tscg-compressed schemas accommodate beyond 800 tools, extending operational range by 63%. The compression thus becomes a categorical enabler for tool-dense agentic RAG pipelines.

Analysis

A descriptive context utilization model, C(k)=Cmax(1−e−λk)+C0C(k) = C_{\text{max}} (1 - e^{-\lambda k}) + C_0, is fit to the entire result set, revealing that the marginal gain from the first few retrieval chunks is dominant. Consequently, schema compression that recovers even minimal retrieval budget converts a system from non-functional (overflow) to functional by enabling this crucial first evidence chunk.

Furthermore, when excessive context is available (after compression), small models sometimes see degraded accuracy—a distractor dilution effect reminiscent of "lost in the middle" (Liu et al., 2023), highlighting the importance of balancing context breadth with model attention capacity.

Implications and Theoretical Significance

The results reconceptualize schema compression from a niche optimization into a necessary infrastructural layer for agentic RAG, especially in context-constrained deployments for edge or on-premises settings. They demonstrate that practically relevant agentic RAG deployments cannot rely on uncompressed schemas for moderate tool counts under industry-standard context limits.

Compression of structured machine-readable representations (e.g., JSON Schema), distinct from natural-language prompt compression, must preserve type fidelity and structural invariants. Techniques such as Tscg—exploiting deterministic, layout-preserving reductions—outperform general-purpose methods and maintain lossless operational semantics up to high tool counts (Sakizli, 4 May 2026).

At higher context windows (32K+) or for deployments with sparse tool exposure, the operational benefits of compression abate, and no adverse effects are introduced—compressed and uncompressed schemas yield equivalent results.

The findings also inform scaling: even with rapid increases in context capacity, tool schema costs will dominate once deployments cross the overflow threshold. In such regimes, combining schema compression with intelligent tool selection and dynamic prompt allocation will be essential for sustainable, scalable, and robust agentic RAG systems.

Conclusion

Tool-schema compression is established as a prerequisite for functional agentic RAG under constrained context budgets. The empirical binary enablement effect—transitioning from system failure to substantial retrieval-based accuracy—arises solely from recovering context budget previously consumed by verbose tool schemas. Compression via Tscg or comparable techniques is shown to be lossless in unconstrained regimes, generalizes beyond synthetic settings, and is essential for scaling to hundreds of tools.

Theoretical implications include the need for future RAG architectures to model context allocation as a zero-sum optimization and to deploy structured compression where schema cost approaches operational limits. Practically, schema compression is a critical deployment primitive, and future work should explore the co-design of dynamic retrieval and compression strategies, as well as model architectural innovations that further mitigate distractor risks in wide-context inference.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.