- The paper’s main contribution is MAC-RAG, a multi-agent framework that iteratively resolves conflicting evidence by synthesizing individual document responses.
- It introduces the RAMDocs dataset to simulate real-world scenarios with varied valid answers, misinformation, and noise for robust retrieval-augmented generation.
- Empirical results demonstrate that MAC-RAG outperforms baselines, significantly improving accuracy in handling ambiguity and evidence imbalance.
This paper addresses the challenge of handling conflicting information within Retrieval-Augmented Generation (RAG) systems (2504.13079). Real-world RAG applications often encounter ambiguous user queries and retrieve documents containing a mix of correct information, misinformation, and irrelevant noise. Existing methods typically tackle these issues in isolation, but this work proposes approaches to handle them simultaneously.
Problem: Standard RAG systems struggle when retrieved documents present conflicting information due to:
- Ambiguity: The query refers to multiple entities (e.g., "Michael Jordan" the basketball player vs. the professor), requiring the system to present all valid answers.
- Misinformation: Some documents contain plausible but factually incorrect information.
- Noise: Some documents are irrelevant to the query.
The challenge is to differentiate between valid conflicts (ambiguity) and invalid ones (misinformation/noise) and respond appropriately.
Proposed Solution: RAMDocs Dataset and MAC-RAG Method
- RAMDocs Dataset: To evaluate systems under realistic conflict scenarios, the paper introduces the RAMDocs dataset.
- Construction: Built upon AmbigDocs [lee2024ambigdocs], which focuses on ambiguity. RAMDocs samples queries with 1-3 correct answers.
- Features:
- Variable Document Support: The number of documents supporting each valid answer is varied (1-3 docs per answer), simulating real-world retrieval imbalances. Supporting documents are retrieved using the Brave Search API and chunked.
- Misinformation: Documents containing plausible but incorrect entities (swapped using a method similar to [longpre-etal-2021-entity]) are added (0-2 per query).
- Noise: Irrelevant or low-quality documents are included (0-2 per query).
- Purpose: Provides a challenging benchmark requiring simultaneous handling of ambiguity, misinformation, noise, and evidence imbalance. On average, each query has 2.20 valid answers within a pool of 5.53 documents (3.84 supporting valid answers, 1.70 misinformation/noise).
- MAC-RAG (Multi-Agent Conflict Resolution RAG): A multi-agent framework to process conflicting retrieved documents.
- Architecture:
- Individual Agents: Each retrieved document di is assigned to an independent LLM agent Li (instantiated from the same base LLM). The agent generates an initial response ri(0)=Li(q,di) based only on its assigned document and the query q. This isolates perspectives and prevents dominant but potentially incorrect information from overwhelming minority valid answers early on.
- Aggregator: A central aggregator module A receives all agent responses R(t)={ri(t)}i=1n in round t and synthesizes a combined answer and explanation (y(t),e(t))=A(R(t)). It identifies consensus, resolves conflicts, distinguishes ambiguity from misinformation, and filters noise based on the agents' reasoning.
- Multi-Round Debate: The process iterates for up to T rounds. In each round t>0, agents receive the aggregator's summary from the previous round (y(t−1),e(t−1)) and are prompted to reflect and potentially revise their response: ri(t)=Li(q,di,y(t−1),e(t−1)). This allows agents supporting valid but conflicting answers (due to ambiguity) to maintain their stance while agents relying on misinformation may retract or be overruled based on the collective evidence and reasoning.
- Early Stopping: The debate stops if all agent responses remain unchanged from the previous round (ri(t)=ri(t−1) for all i), or after T rounds. The final answer is y=A(R(tend)).
* Implementation Detail: The interaction flow can be visualized as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
graph LR
subgraph Round t
A[Agent 1 (Doc 1)] -- r1(t) --> AGG[Aggregator];
B[Agent 2 (Doc 2)] -- r2(t) --> AGG;
C[...] -- ... --> AGG;
D[Agent n (Doc n)] -- rn(t) --> AGG;
end
AGG -- y(t), e(t) --> E{Summary};
subgraph Round t+1
E -- y(t), e(t) --> A1[Agent 1];
E -- y(t), e(t) --> B1[Agent 2];
E -- y(t), e(t) --> C1[...];
E -- y(t), e(t) --> D1[Agent n];
A1 -- r1(t+1) --> AGG1[Aggregator];
B1 -- r2(t+1) --> AGG1;
C1 -- ... --> AGG1;
D1 -- rn(t+1) --> AGG1;
end
Q[Query] --> A & B & C & D;
Doc1 --> A;
Doc2 --> B;
DocN --> D;
AGG1 --> FinalAnswer[Final Answer]; |
Experiments and Results
- Datasets: FaithEval (misinformation), AmbigDocs (ambiguity), RAMDocs (combined challenges).
- Models: Llama3.3-70B-Instruct, Qwen2.5-72B-Instruct, GPT-4o-mini.
- Baselines: No RAG (parametric only), Concatenated-prompt (standard RAG), Astute RAG wang2024astuteragovercomingimperfect.
- Metric: Exact Match (must include all gold answers and no incorrect answers).
- Key Findings:
- MAC-RAG consistently outperformed baselines across all datasets and models (e.g., +11.40% vs. Astute RAG on AmbigDocs with Llama3.3; +15.80% vs. Concatenated-prompt on FaithEval with Llama3.3).
- RAMDocs proved significantly more challenging for all methods, with top scores around 30-35% EM, highlighting the difficulty of handling combined conflicts. MAC-RAG still provided the best results on RAMDocs.
- Ablations showed that both the multi-round debate (+5.30% accuracy on FaithEval) and the aggregator module (+19% accuracy on FaithEval) significantly contribute to MAC-RAG's performance, particularly improving precision by effectively filtering misinformation.
- Analysis showed MAC-RAG is more robust than baselines to varying numbers of supporting documents (less drop in performance with imbalance) and increasing levels of misinformation (degrades less severely).
Practical Implementation Considerations
- Computational Cost: MAC-RAG requires multiple LLM calls per query: N agent calls + 1 aggregator call per round, for potentially multiple rounds (T). This is significantly more expensive than standard RAG (1 call) or Astute RAG (few calls). Early stopping helps mitigate this, but the cost scales with the number of documents N and rounds T.
- Prompting: Specific prompts are needed for the agents (focusing on their single document initially, then incorporating the summary) and the aggregator (synthesizing diverse inputs, explaining reasoning). Examples are provided in the paper's appendix.
- Scalability: The approach might face challenges with very large numbers of documents (N), as the aggregator needs to process N potentially lengthy responses, and the summary passed back to agents could grow complex.
- Trade-offs: There's a clear trade-off between computational cost and robustness. MAC-RAG invests more computation to achieve better handling of complex conflicts. The ablation paper also highlights a precision/recall trade-off managed by the aggregator.
- Use Cases: Best suited for applications where handling ambiguity correctly (presenting multiple valid options) and robustness to misinformation are critical, even at a higher computational cost, such as complex information synthesis, research assistance tools, or fact-checking systems.
Code and Data: The RAMDocs dataset and MAC-RAG code are available at: https://github.com/HanNight/RAMDocs.