TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework (2511.05385v1)

Published 7 Nov 2025 in cs.IR and cs.AI

Abstract: Retrieval-Augmented Generation (RAG) utilizes external knowledge to augment LLMs' (LLMs) reliability. For flexibility, agentic RAG employs autonomous, multi-round retrieval and reasoning to resolve queries. Although recent agentic RAG has improved via reinforcement learning, they often incur substantial token overhead from search and reasoning processes. This trade-off prioritizes accuracy over efficiency. To address this issue, this work proposes TeaRAG, a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps. 1) First, the retrieved content is compressed by augmenting chunk-based semantic retrieval with a graph retrieval using concise triplets. A knowledge association graph is then built from semantic similarity and co-occurrence. Finally, Personalized PageRank is leveraged to highlight key knowledge within this graph, reducing the number of tokens per retrieval. 2) Besides, to reduce reasoning steps, Iterative Process-aware Direct Preference Optimization (IP-DPO) is proposed. Specifically, our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps. This design can produce high-quality preference-pair datasets, supporting iterative DPO to improve reasoning conciseness. Across six datasets, TeaRAG improves the average Exact Match by 4% and 2% while reducing output tokens by 61% and 59% on Llama3-8B-Instruct and Qwen2.5-14B-Instruct, respectively. Code is available at https://github.com/Applied-Machine-Learning-Lab/TeaRAG.

Summary

The paper presents a novel agentic RAG framework that reduces token usage by up to 61% while enhancing QA accuracy.
It details a five-module pipeline combining semantic and graph-based retrieval with Personalized PageRank filtering to optimize token efficiency.
Empirical evaluations show improved Exact Match scores and concise, stable reasoning paths across multiple multi-hop QA benchmarks.

Token-Efficient Agentic RAG via TeaRAG

TeaRAG presents an agentic Retrieval-Augmented Generation (RAG) framework focused on maximizing token efficiency during both retrieval and reasoning phases for LLMs. This approach addresses redundant token consumption endemic in current agentic RAG architectures by compressing retrieved content, optimizing reasoning steps, and innovatively combining semantic and graph-based retrieval with efficient reward-driven learning protocols.

Motivation and Problem Analysis

Agentic RAG systems empower LLMs to autonomously plan, break down, and solve complex queries through multi-step retrieval and reasoning. However, extant agentic RAG implementations typically optimize for final accuracy, disregarding efficiency metrics such as total token usage. This results in excessive content retrieval (often entire document chunks with substantial irrelevant material) and redundant, overlong reasoning paths ("overthinking"), both of which incur significant computational and economic cost. In addition, reliance on outcome-based sparse reward signals during RL-based training further destabilizes optimization and impedes control over intermediate process efficiency.

Figure 1: TeaRAG achieves higher token efficiency by compressing both retrieval content and reasoning steps via graph-based triplet retrieval and process-aware preference learning.

Framework Architecture and Pipeline

TeaRAG is realized as an agentic pipeline orchestrated by an LLM agent, encompassing five sequential modules:

Important Entity Recognition: Identification of anchor entities guiding query decomposition and retrieval.
Subquery Generation: Decomposition of the original task into entity-focused subproblems to promote targeted retrieval.
Hybrid Context Retrieval: Parallel extraction of relevant content via (a) chunk-based semantic retrieval and (b) knowledge graph-driven triplet retrieval. A Knowledge Association Graph (KAG) is constructed over the union of these sources, integrating semantic similarity and co-occurrence signals.
Token-Efficient Context Selection: KAG is filtered via Personalized PageRank (PPR), using a personalization vector and tradeoff parameter $\alpha$ to balance query relevance and graph topology, maximizing information density per retrieved token.
Summary Generation and Termination: Iterative agentic rollouts are performed until a process-aware reward signal detects sufficient evidence accumulation, at which point the agent halts further reasoning and outputs the answer.
Figure 2: The TeaRAG pipeline interleaves LLM-driven entity recognition, subquery decomposition, hybrid chunk/triplet retrieval, and PPR-based content compression over a large-scale knowledge graph.

This workflow exploits complementary strengths of semantic chunks (contextual richness) and graph triplets (high factual density), leveraging co-occurrence as a robust filter for token-efficient evidence selection.

Process-Aware Learning via IP-DPO

The framework is trained in two distinct phases:

Stage 1: Supervised Fine-Tuning (SFT): Leveraging multi-hop QA datasets (MuSiQue, HotpotQA, NQ), reasoning chains and ground-truth evidences are converted to natural language in a chain-of-thought format with aligned chunk/triplet context. SFT teaches the LLM agent to emulate agentic workflows and compositional reasoning.
Stage 2: Iterative Process-aware Direct Preference Optimization (IP-DPO): To enforce token efficiency, process-aware rewards are crafted evaluating:
- Knowledge sufficiency (through memory vectors aggregating evidence coverage throughout reasoning steps)
- Format adherence
- Reasoning conciseness (steps required)
- Entity-subquery consistency

Preference pairs are constructed between ⟨chosen, rejected⟩ reasoning paths using outcome, format, and process rewards. DPO is performed iteratively, with each cycle resampling outputs using the latest model weights, progressively reinforcing concise and effective reasoning patterns while stabilizing alignment with ground-truth evidence distributions.

Figure 3: TeaRAG two-stage model training: initial SFT on structured multi-hop reasoning, followed by multi-round IP-DPO guided with process-aware rewards.

Empirical Evaluation

Experiments were conducted across six QA benchmarks (NQ, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle), using Llama3-8B and Qwen2.5-14B as backbone models. Baselines encompassed standard RAG, agentic RAG approaches (IRCoT, R1-Searcher, Search-R1), and hybrid retrieval variants.

Key findings:

Token Reduction: TeaRAG achieves a 61% output token reduction with Llama3-8B and 59% reduction with Qwen2.5-14B compared to Search-R1 baselines, despite matching or exceeding accuracy.
Improved EM/F1: TeaRAG improves average Exact Match scores by 4% (Llama3-8B, statistically significant) and 2% (Qwen2.5-14B) over the best agentic RAG baselines.
Reasoning Steps: The number of reasoning steps per query is notably reduced; IP-DPO yields highly stable and concise chains immune to model scale sensitivity, contrasting with unstable baseline step distributions.
Figure 4: Token usage (a), reasoning step distributions (b), and F1 performance (c) across benchmarks demonstrate TeaRAG’s simultaneous gains in efficiency and accuracy.

Figure 5: Comparing TeaRAG with baselines, TeaRAG produces shorter, more efficient reasoning paths across model scales.

Figure 6: TeaRAG reduces overall output token usage and per-retrieval token count, confirming effectiveness of graph-based filtering.

Ablations and Analysis

Detailed ablations confirm:

Hybrid retrieval and PPR filtering are critical: Direct concatenation or single-mode retrieval yields suboptimal accuracy or excessive token bloat. PPR-based KAG filtering selectively amplifies high-confidence content, maintaining accuracy under input length constraints.
Process reward necessity: Training with outcome-only signals leads to unstable learning and collapse; process-aware supervision is essential for maintaining reasoning path structure and concise step generation.
IP-DPO convergence: Performance gains attenuate after several iterations as reasoning path quality plateaus, indicating stable alignment.
Scalability: TeaRAG is robust to changes in key hyperparameters (number of input contents per retrieval; $\alpha$ in PPR), without observable performance degradation.
Figure 7: TeaRAG accuracy steadily rises across IP-DPO iterations for Llama3-8B, converging after several rounds.

Engineering and Deployment Considerations

Computational Resources: TeaRAG’s training and inference runtimes are substantially lower than PPO-based RL baselines, with lower per-GPU memory profiles attributed to LoRA parameter-efficient finetuning and decoupled sample/train workflow.
Graph Scale: Scalable KAG construction over millions of Wikipedia-derived entities and triplets is validated; PPR can be performed efficiently within practical inference pipelines.
Domain Transferability: Out-of-domain generalization is strong, supporting cross-corpus and multi-hop QA transfer.
Reproducibility: Codebase is open-sourced for full pipeline replication.

Future Directions

Future work can focus on extending KAG construction to dynamically update knowledge graphs under streaming corpora, incorporating more granular process supervision (e.g., cross-step causal alignment), generalizing to retrieval with multi-modal evidence sources, and further coupling token efficiency with fine-grained latency profiling for real-world production deployment (e.g., cloud LLMs with strict cost constraints).

Conclusion

TeaRAG establishes a robust foundation for scalable, process-efficient agentic RAG. By integrating hybrid, graph-enhanced retrieval with process-aware preference modeling, TeaRAG achieves state-of-the-art results in QA accuracy and token efficiency, advancing both the theoretical understanding and practical deployment of retrieval-augmented LLM systems.