MAIN-RAG: Multi-Agent Filtering for RAG

Updated 19 November 2025

The paper introduces MAIN-RAG which enhances retrieval-augmented generation by integrating multiple LLM agents to filter noisy evidence and synthesize accurate answers.
The modular framework decomposes the process into predictor, judge, and final-predictor agents, using adaptive thresholding to dynamically assess document relevance.
Experimental results demonstrate improved performance on QA benchmarks, with significant gains in answer faithfulness, precision, and robustness across diverse domains.

Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG) is a modular framework for enhancing the correctness, robustness, and interpretability of LLM-based retrieval-augmented generation systems through collaborative filtering and adaptive control performed by multiple specialized agents. MAIN-RAG explicitly addresses the persistent issue of noisy or irrelevant retrievals degrading response quality in traditional RAG settings by inserting an ensemble of LLM agents—each focused on successive stages of evidence selection, query decomposition, and answer synthesis—between the retrieval and generation phases. The architecture supports both training-free and learned variants, enables flexible workflow orchestration, and demonstrates consistent improvements across a broad array of QA and knowledge-intensive generation benchmarks (Chang et al., 31 Dec 2024, Besrour et al., 17 Oct 2025, Besrour et al., 20 Jun 2025).

1. Architectural Foundations and Agent Roles

MAIN-RAG decomposes the retrieval-augmented generation pipeline into a set of autonomous LLM agents, each targeting a distinct pipeline subcomponent, and connected by structured communication protocols. The canonical training-free architecture comprises three roles (Chang et al., 31 Dec 2024):

Predictor Agent: For each retrieved document $d_i$ , generates a provisional answer $a_i$ conditioned on the query $q$ ; outputs document-query-answer triplets.
Judge Agent: Evaluates each $(d_i,q,a_i)$ triplet to compute a continuous log-odds relevance score $s_i = \log P_\text{Yes} - \log P_\text{No}$ , interpreting this as the document’s supportiveness for the given query and candidate answer.
Final-Predictor Agent: Receives the adaptively filtered, relevance-ranked set of documents and synthesizes the final answer conditioned on this pruned context.

Extensible designs introduce further roles, including Planner Agents (for decomposing complex queries or orchestrating multi-hop retrieval), Step Definer and Extractor Agents (for explicit evidence extraction), Refiner/Reviser Agents (for iterative correctness and coverage verification), and Citation Agents (for attributing claims) (Besrour et al., 17 Oct 2025, Besrour et al., 20 Jun 2025).

The table below summarizes typical agent roles in MAIN-RAG implementations:

Agent Role	Description	Example Implementations
Predictor	Provisional answer generation per doc	(Chang et al., 31 Dec 2024, Besrour et al., 20 Jun 2025)
Judge/Filter	Document supportiveness scoring/filtering	(Chang et al., 31 Dec 2024, Besrour et al., 17 Oct 2025)
Final-Predictor	Synthesis over filtered docs	(Chang et al., 31 Dec 2024, Besrour et al., 20 Jun 2025)
Planner/Decomposer	Query/step decomposition	(Nguyen et al., 26 May 2025, Besrour et al., 17 Oct 2025)
Refiner	Completeness checking, gap-filling	(Besrour et al., 20 Jun 2025)
Citation Generator	In-line attribution generation	(Besrour et al., 17 Oct 2025, Besrour et al., 20 Jun 2025)

MAIN-RAG architectures generally operate in a plug-and-play manner: each agent is invoked on demand, and the system can dynamically adjust agent invocation according to query complexity or evidence sufficiency (Chang et al., 31 Dec 2024, Besrour et al., 17 Oct 2025).

2. Adaptive Filtering and Consensus Mechanisms

The core innovation of MAIN-RAG lies in its adaptive document filtering strategy, implemented through inter-agent consensus and dynamic relevance thresholding. After initial retrieval (commonly top- $k$ from a dense or hybrid index), the filtering proceeds as follows (Chang et al., 31 Dec 2024, Besrour et al., 17 Oct 2025):

For each document $d_i$ , the Predictor produces $a_i$ .
The Judge computes relevance scores $s_i = \log P_\text{Yes}(d_i,q,a_i) - \log P_\text{No}(d_i,q,a_i)$ , forming the set $R = \{s_1,\ldots,s_N\}$ .
The adaptive “judge bar” threshold is set as

$\tau = \mu_R - n\sigma_R,$

where $\mu_R$ is the mean and $\sigma_R$ the standard deviation of $\{s_i\}$ , and $n$ is a tunable hyperparameter (typically $n\in[0,0.5]$ ).

Documents with $s_i \geq \tau$ are retained; others are discarded.
The filtered, ranked list is then passed to downstream agents for synthesis.

This adaptive thresholding is self-calibrating: if retrieved documents are high quality, only the top-confidence items survive; if retrieval is mediocre (low $\mu_R$ ), the filter is more permissive to preserve recall.

Several extensions refine this consensus:

Sub-question-specific thresholds (separate $\mu_q,\sigma_q$ per decomposed query) (Besrour et al., 17 Oct 2025).
Multi-agent voting or refinement protocols in the presence of ambiguous or contradictory evidence (Besrour et al., 20 Jun 2025).
Dynamic adjustment of $n$ or learning threshold schedule via meta-learning (Besrour et al., 17 Oct 2025, Chang et al., 31 Dec 2024).

3. Integration with Query Decomposition and Modular Reasoning

MAIN-RAG supports explicit decomposition of complex information-seeking tasks via planning and step-defining agents (Nguyen et al., 26 May 2025, Besrour et al., 17 Oct 2025, Salemi et al., 12 Jun 2025). For compound or ambiguous queries:

Planner/Decomposer agent splits the top-level question $Q$ into a set of sub-questions $\{q_1,\ldots,q_m\}$ , each targeting a distinct aspect of $Q$ .
For each $q_i$ , retrieval and evidence filtering proceed independently (using adaptive judge bars as above).
A synthesis agent then aggregates answers and supporting evidences, optionally verifying compositional coverage and attribution.

This modular design is crucial for multi-hop reasoning, question ambiguity resolution, and integrating evidence from heterogeneous sources (Nguyen et al., 26 May 2025, Besrour et al., 20 Jun 2025).

4. Hybrid and Domain-Aware Retrieval

MAIN-RAG systems frequently employ hybrid retrieval—combining sparse (BM25) and dense (embedding) retrieval—to maximize recall and coverage in large or heterogeneous corpora (Besrour et al., 17 Oct 2025, Besrour et al., 20 Jun 2025). The hybrid score for a document $d$ and query $q$ is computed as

$S_\text{hybrid}(d|q) = \alpha S_\text{sparse}(d|q) + (1-\alpha) S_\text{dense}(d|q),$

with $\alpha$ tuned for task or dataset (e.g., $\alpha=0.35$ in arXiv-scale scientific QA (Besrour et al., 17 Oct 2025)) to interpolate between term-based and semantic similarity.

Domain-aware agent routing is applied for complex data environments, with specialized agents and modular adapters for structured (SQL), semi-structured (NoSQL, graph), or unstructured (text) sources (Salve et al., 8 Dec 2024). Lightweight classifiers or learned routing modules dispatch queries to the appropriate agent, reducing token overhead and irrelevant data fusion.

5. Experimental Results and Performance Analysis

MAIN-RAG consistently outperforms traditional RAG baselines on a variety of open-domain, closed-domain, and multi-hop QA tasks, as well as in specialized domains such as scientific QA and fintech (Chang et al., 31 Dec 2024, Besrour et al., 17 Oct 2025, Cook et al., 29 Oct 2025, Besrour et al., 20 Jun 2025). Representative results include:

Training-free MAIN-RAG (Mistral-7B): TriviaQA accuracy $71.0\%$ (vs. $69.4\%$ baseline), PopQA $58.9\%$ (vs. $55.5\%$ ), ARC-C $58.9\%$ (vs. $57.1\%$ ) (Chang et al., 31 Dec 2024).
SQuAI (domain-wide scientific QA): $12\%$ improvement in faithfulness, precision gains in claim relevance, with every factual claim grounded by in-line citation and evidence context extraction (Besrour et al., 17 Oct 2025).
RAGentA: $+10.7\%$ in answer faithfulness over standard RAG baselines, with hybrid retrieval yielding $+12.5\%$ Recall@20 and integrated revision agents boosting answer coverage (Besrour et al., 20 Jun 2025).
Fintech MAIN-RAG: Strict Hit@5 increased from $54.1\%$ to $62.4\%$ , semantic accuracy improved by 0.69, with procedural queries seeing nearly double coverage (Cook et al., 29 Oct 2025).

In all cases, the cost of additional agentic processing (e.g., multi-agent filtering and sub-query decomposition) is an increase in latency, typically $3$–$5$ seconds per query for complex scientific or industry domains (Cook et al., 29 Oct 2025). However, the reliability and faithfulness improvements are substantial and demonstrable.

6. Extensions, Analysis, and Deployment Considerations

MAIN-RAG is extensible to learned multi-agent RL or adversarial training settings:

Multi-agent RL optimization (e.g., MMOA-RAG) aligns Query-Rewriter, Selector, and Generator agents by optimizing a shared reward (such as F1), with policy gradients distributed across all roles (Chen et al., 25 Jan 2025).
Adversarial multi-agent tuning introduces an Attacker agent that fabricates plausible but misleading documents, forcing the Generator to develop robustness to fake or noisy contexts via KL regularization (Zhu et al., 28 May 2024).
Plug-and-play deployment is a key design feature: MAIN-RAG can be overlaid onto any retriever-generator pipeline without retraining or fine-tuning, with agent prompts modular and independently upgradable (Chang et al., 31 Dec 2024).

Deployment guidelines emphasize minimal prompt design, small retrieved document sets (top-20), monitorable relevance distribution statistics, and descending score ordering for context concatenation. The architecture is suitable in on-premise, data-sensitive, or high-stakes environments where robustness and interpretability are paramount (Cook et al., 29 Oct 2025, Besrour et al., 17 Oct 2025).

7. Interpretability, Modular Control, and Future Work

MAIN-RAG advances interpretable retrieval-augmented generation by exposing intermediate agent decisions, filtering trajectories, and explicit consensus scores at each stage (Chang et al., 31 Dec 2024, Besrour et al., 20 Jun 2025). Modular agent invocation allows dynamic resource allocation and selective refinement, particularly in ambiguous or multi-aspect queries. Emerging research suggests further gains can be realized by:

Generalizing threshold parameters as learned, query-dependent functions.
Enriching agent collaboration protocols (e.g., incorporating verification and contradiction-resolution agents).
Extending to open-world, multi-modal RAG with specialized retrievers for text, graph, and structured tabular data under unified supervisor agents (Xu et al., 1 Sep 2025).

MAIN-RAG establishes a cohesive paradigm for robust, precise, and maintainable retrieval-augmented generation, and provides a blueprint for future adaptation and application across domains that require high-fidelity LLM outputs grounded in dynamic, heterogeneous evidence (Chang et al., 31 Dec 2024, Besrour et al., 17 Oct 2025, Besrour et al., 20 Jun 2025).