MAIN-RAG: Multi-Agent Filtering RAG

Updated 19 November 2025

MAIN-RAG is a multi-agent system that decomposes the retrieval and generation process into specialized agents for filtering, scoring, and synthesizing documents.
It employs dynamic thresholding, adaptive consensus, and geometric merging to systematically prune noisy or irrelevant context from retrieved data.
The framework improves QA performance by delivering significant gains in correctness, faithfulness, and interpretability across diverse open-domain and specialized tasks.

Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG) is a system class that integrates multiple autonomous LLM agents into the Retrieval-Augmented Generation (RAG) paradigm, enabling robust filtering, selection, and synthesis of external documents for improved answer accuracy, faithfulness, and compositional reasoning in open-domain and specialized question answering. Architectures in this domain deploy multi-agent workflows to overcome limitations of single-agent retrieval baselines—particularly issues of noisy or irrelevant context inclusion, insufficient grounding, and poor modularity. MAIN-RAG approaches use dynamic thresholding, agentic reasoning, adaptive consensus scoring, and iterative refinement to systematically “winnow” non-supporting evidence, often without model fine-tuning. Recent research demonstrates significant gains in correctness, faithfulness, and transparency across diverse QA tasks, with key frameworks including WinnowRAG (Wang et al., 1 Nov 2025), RAGentA (Besrour et al., 20 Jun 2025), SIRAG (Wang et al., 17 Sep 2025), and foundational training-free variants (Chang et al., 2024).

1. MAIN-RAG Framework Architectures and Principles

MAIN-RAG frameworks instantiate multiple independent or cooperative agents—typically LLMs, occasionally lightweight variants—that sequentially or in parallel process, score, filter, and synthesize documents and context retrieved for a user query. These architectures usually decompose the RAG pipeline into modules such as:

Predictor agent: generates candidate answers per document (Besrour et al., 20 Jun 2025, Chang et al., 2024).
Judge agent: assigns relevance, support, or faithfulness scores, often by assessing whether a document can substantiate the generated answer (Besrour et al., 20 Jun 2025). Scoring may use log-probability differences between token predictions for “Yes”/“No” prompts, giving a continuous measure of support.
Filtering agent/module: prunes contexts below adaptive, query-specific relevance thresholds derived from score distributions (Chang et al., 2024, Besrour et al., 17 Oct 2025).
Coordinator or Critic agent: aggregates, merges, and cross-validates outputs from multiple agents, sometimes through geometric merging (ellipse/hyperbola) or semantic clustering (Wang et al., 1 Nov 2025).
Synthesizer/Final-Predictor: generates the final answer, typically with in-line citations to supporting documents for verifiable QA (Besrour et al., 20 Jun 2025, Besrour et al., 17 Oct 2025).

Modularity and strict agent separation are key: each agent performs a specialized function, passes outputs in canonical format, and may interact over well-defined protocols (e.g., JSON over HTTP microservices (Besrour et al., 17 Oct 2025)).

2. Document Filtering Mechanisms

MAIN-RAG systems systematically filter noisy, irrelevant, or misleading documents using agentic consensus and adaptive statistical criteria. Typical mechanisms include:

Log-probability difference scoring:

$s_i = \log p(\mathrm{“Yes”} | T_i) - \log p(\mathrm{“No”} | T_i)$

where $T_i = (q, d_i, a_i)$ triplet combines the query, document, and candidate answer. This continuous score captures the LLM judge’s confidence in the document’s support (Besrour et al., 20 Jun 2025, Chang et al., 2024, Besrour et al., 17 Oct 2025).

Adaptive thresholding: The filter threshold for a query is dynamically computed from score distributions,

$\tau_q = \mathrm{mean}(s_i), \quad \sigma = \mathrm{std}(s_i), \quad \tau'_q = \tau_q - n \cdot \sigma$

Documents with $s_i \ge \tau'_q$ are retained, with $n$ tuned (often $0.5$) to balance recall and precision (Besrour et al., 20 Jun 2025, Chang et al., 2024, Besrour et al., 17 Oct 2025).

Clustering and merge-based winnowing: Query-aware clustering—K-means on embeddings of $(q \oplus d_i)$ —groups retrieved documents into topic-centric clusters, each assigned to an agent for localized answering (Wang et al., 1 Nov 2025). Merging techniques further consolidate high-certainty answers and their supporting document pools using geometry-inspired (ellipse/hyperbola) rules.
Multi-agent consensus: Multiple judges may be run in parallel, and their scores averaged for robust selection (Chang et al., 2024).

This agent-driven, flexible filtering approach enables context pruning tailored to per-query noise level, outperforming static top- $K$ selection.

3. Hybrid Retrieval Strategies and Document Scoring

MAIN-RAG designs frequently use hybrid retrieval to enhance recall and context diversity. The standard approach linearly interpolates sparse (BM25) and dense (embedding) retrieval scores: $S_{\mathrm{hybrid}}(d) = \alpha S_{\mathrm{sparse}}(d) + (1-\alpha) S_{\mathrm{dense}}(d)$ Values of $\alpha$ are tuned per system (e.g., $0.65$ in RAGentA (Besrour et al., 20 Jun 2025), $0.35$ in SQuAI (Besrour et al., 17 Oct 2025)), with the top candidates by $S_{\mathrm{hybrid}}$ passed down the pipeline. Dense retrievers (e.g., E5, SBERT) encode queries and documents into latent vectors, scored by cosine similarity.

Advanced systems may also perform query-aware clustering for top- $N$ documents ( $N$ often $20$–$50$), then assign local LLM agents per cluster (Wang et al., 1 Nov 2025). This guarantees topical diversity and reduces redundant evidence in the answer synthesis.

4. Answer Generation, Citation Integration, and Dynamic Refinement

Synthesis modules (often labeled “Final-Predictor” or “Reviser”) generate the ultimate answer from filtered, ranked documents, frequently with in-line citations for factual traceability. Citation enforcement is achieved through prompt engineering, e.g. instructing the model to append citation tokens (e.g., “[3]”, “[7,12]”) immediately after every factual claim (Besrour et al., 20 Jun 2025, Besrour et al., 17 Oct 2025). Agentic workflows verify completeness by decomposing multi-part queries into sub-questions, assessing coverage, and issuing additional targeted retrievals if required (Besrour et al., 20 Jun 2025, Besrour et al., 17 Oct 2025). This dynamic refinement loop halts when all sub-questions are “fully” answered or the retriever returns no new relevant contexts.

5. Evaluation Metrics and Empirical Performance

MAIN-RAG frameworks emphasize metrics at both context and answer levels:

Correctness ( $C$ ): Defined by coverage ( $Cov$ ) and relevance ( $Rel$ ) w.r.t. ground truth.

$Cov = \frac{|facts\;in\;A \cap facts\;in\;GT|}{|facts\;in\;GT|} \quad Rel = \frac{|relevant\;tokens\;in\;A|}{|tokens\;in\;A|}$

LLM-based judges may emit categorical or continuous values for answer types (Besrour et al., 20 Jun 2025).

Faithfulness ( $F$ ): Measures the proportion of claims in the answer verifiably grounded in retained documents; 1 if fully supported, 0 if incomplete, −1 if unsupported (Besrour et al., 20 Jun 2025, Besrour et al., 17 Oct 2025).
Retrieval metrics: MRR@20, Recall@20, and clustering-based precision/recall for filtered pools (Besrour et al., 20 Jun 2025, Wang et al., 1 Nov 2025).
Empirical gains: RAGentA yields +1.1% in correctness and +10.7% in faithfulness over strong hybrid RAG baselines (500 QA pairs, FineWeb) (Besrour et al., 20 Jun 2025); WinnowRAG achieves 68.1% accuracy on PopQA, outperforming InstructRAG-ICL by 4 points (8B, zero-shot) (Wang et al., 1 Nov 2025). Ablations demonstrate that removing multi-agent filtering, clustering, or merging degrades accuracy by 2–10%.

6. Agentic Ablations, Limitations, and Extensibility

Ablation studies across MAIN-RAG variants reveal that every agent contributes materially to performance:

System Variant	Performance Impact
Remove filtering module	–2–7 pts accuracy
Replace query-aware clustering	–2–4 pts accuracy
Omit geometric merging	–5 pts accuracy
Drop iterative winnowing	–6–10 pts accuracy

MAIN-RAG architectures are typically training-free, operating via zero- or few-shot prompting, and compatible with plug-and-play extension to new LLM families or retrieval modalities (Chang et al., 2024, Besrour et al., 20 Jun 2025, Wang et al., 1 Nov 2025). Limitations involve increased inference cost for agentic evaluation, prompt sensitivity in judge modules, and scaling challenges as number of agents or sub-questions grows. Robustness can be further enhanced via process-level RL reward schemes (Wang et al., 17 Sep 2025).

7. Broader Impact and Research Directions

MAIN-RAG methods generalize across open-domain, scientific, and domain-specialized QA tasks, providing increased robustness against spurious and noisy context inclusion, and enable more interpretable, auditable reasoning chains—especially when combined with in-line citation workflows (Besrour et al., 20 Jun 2025, Besrour et al., 17 Oct 2025). Future directions include dynamic inter-agent thresholding, multi-modal agent teams, RL-driven agent orchestration, and process-distilled lightweight critics for scalable multi-agent supervision.

MAIN-RAG research establishes multi-agent filtering as an essential paradigm in RAG system design, proving that modular agent decomposition can systematically improve retrieval, denoising, and answer synthesis, and enabling high-accuracy, transparent, and reliable information integration for LLMs (Besrour et al., 20 Jun 2025, Wang et al., 1 Nov 2025, Chang et al., 2024).