Retrieval-Augmented Generation with Conflicting Evidence (2504.13079v1)

Published 17 Apr 2025 in cs.CL and cs.AI

Abstract: LLM agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs -- which requires presenting all valid answers for ambiguous queries -- improving over strong RAG baselines by up to 11.40% and on FaithEval -- which requires suppressing misinformation -- where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match score). While MADAM-RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.

Summary

The paper’s main contribution is MAC-RAG, a multi-agent framework that iteratively resolves conflicting evidence by synthesizing individual document responses.
It introduces the RAMDocs dataset to simulate real-world scenarios with varied valid answers, misinformation, and noise for robust retrieval-augmented generation.
Empirical results demonstrate that MAC-RAG outperforms baselines, significantly improving accuracy in handling ambiguity and evidence imbalance.

This paper addresses the challenge of handling conflicting information within Retrieval-Augmented Generation (RAG) systems (2504.13079). Real-world RAG applications often encounter ambiguous user queries and retrieve documents containing a mix of correct information, misinformation, and irrelevant noise. Existing methods typically tackle these issues in isolation, but this work proposes approaches to handle them simultaneously.

Problem: Standard RAG systems struggle when retrieved documents present conflicting information due to:

Ambiguity: The query refers to multiple entities (e.g., "Michael Jordan" the basketball player vs. the professor), requiring the system to present all valid answers.
Misinformation: Some documents contain plausible but factually incorrect information.
Noise: Some documents are irrelevant to the query. The challenge is to differentiate between valid conflicts (ambiguity) and invalid ones (misinformation/noise) and respond appropriately.

Proposed Solution: RAMDocs Dataset and MAC-RAG Method

RAMDocs Dataset: To evaluate systems under realistic conflict scenarios, the paper introduces the RAMDocs dataset.
- Construction: Built upon AmbigDocs [lee2024ambigdocs], which focuses on ambiguity. RAMDocs samples queries with 1-3 correct answers.
- Features:
  - Variable Document Support: The number of documents supporting each valid answer is varied (1-3 docs per answer), simulating real-world retrieval imbalances. Supporting documents are retrieved using the Brave Search API and chunked.
  - Misinformation: Documents containing plausible but incorrect entities (swapped using a method similar to [longpre-etal-2021-entity]) are added (0-2 per query).
  - Noise: Irrelevant or low-quality documents are included (0-2 per query).
- Purpose: Provides a challenging benchmark requiring simultaneous handling of ambiguity, misinformation, noise, and evidence imbalance. On average, each query has 2.20 valid answers within a pool of 5.53 documents (3.84 supporting valid answers, 1.70 misinformation/noise).
MAC-RAG (Multi-Agent Conflict Resolution RAG): A multi-agent framework to process conflicting retrieved documents.
- Architecture:
  - Individual Agents: Each retrieved document $d_i$ is assigned to an independent LLM agent $\mathcal{L}_i$ (instantiated from the same base LLM). The agent generates an initial response $r_i^{(0)} = \mathcal{L}_i(q, d_i)$ based only on its assigned document and the query $q$ . This isolates perspectives and prevents dominant but potentially incorrect information from overwhelming minority valid answers early on.
  - Aggregator: A central aggregator module $\mathcal{A}$ receives all agent responses $\mathcal{R}^{(t)} = \{r_i^{(t)}\}_{i=1}^n$ in round $t$ and synthesizes a combined answer and explanation $(y^{(t)}, e^{(t)}) = \mathcal{A}(\mathcal{R}^{(t)})$ . It identifies consensus, resolves conflicts, distinguishes ambiguity from misinformation, and filters noise based on the agents' reasoning.
  - Multi-Round Debate: The process iterates for up to $T$ rounds. In each round $t > 0$ , agents receive the aggregator's summary from the previous round $(y^{(t-1)}, e^{(t-1)})$ and are prompted to reflect and potentially revise their response: $r_i^{(t)} = \mathcal{L}_i(q, d_i, y^{(t-1)}, e^{(t-1)})$ . This allows agents supporting valid but conflicting answers (due to ambiguity) to maintain their stance while agents relying on misinformation may retract or be overruled based on the collective evidence and reasoning.
  - Early Stopping: The debate stops if all agent responses remain unchanged from the previous round ( $r_i^{(t)} = r_i^{(t-1)}$ for all $i$ ), or after $T$ rounds. The final answer is $y = \mathcal{A}(\mathcal{R}^{(t_{\mathrm{end}})})$ .

* Implementation Detail: The interaction flow can be visualized as follows:

graph LR
    subgraph Round t
        A[Agent 1 (Doc 1)] -- r1(t) --> AGG[Aggregator];
        B[Agent 2 (Doc 2)] -- r2(t) --> AGG;
        C[...] -- ... --> AGG;
        D[Agent n (Doc n)] -- rn(t) --> AGG;
    end
    AGG -- y(t), e(t) --> E{Summary};

    subgraph Round t+1
        E -- y(t), e(t) --> A1[Agent 1];
        E -- y(t), e(t) --> B1[Agent 2];
        E -- y(t), e(t) --> C1[...];
        E -- y(t), e(t) --> D1[Agent n];

        A1 -- r1(t+1) --> AGG1[Aggregator];
        B1 -- r2(t+1) --> AGG1;
        C1 -- ... --> AGG1;
        D1 -- rn(t+1) --> AGG1;
    end

    Q[Query] --> A & B & C & D;
    Doc1 --> A;
    Doc2 --> B;
    DocN --> D;

    AGG1 --> FinalAnswer[Final Answer];

Experiments and Results

Datasets: FaithEval (misinformation), AmbigDocs (ambiguity), RAMDocs (combined challenges).
Models: Llama3.3-70B-Instruct, Qwen2.5-72B-Instruct, GPT-4o-mini.
Baselines: No RAG (parametric only), Concatenated-prompt (standard RAG), Astute RAG wang2024astuteragovercomingimperfect.
Metric: Exact Match (must include all gold answers and no incorrect answers).
Key Findings:
- MAC-RAG consistently outperformed baselines across all datasets and models (e.g., +11.40% vs. Astute RAG on AmbigDocs with Llama3.3; +15.80% vs. Concatenated-prompt on FaithEval with Llama3.3).
- RAMDocs proved significantly more challenging for all methods, with top scores around 30-35% EM, highlighting the difficulty of handling combined conflicts. MAC-RAG still provided the best results on RAMDocs.
- Ablations showed that both the multi-round debate (+5.30% accuracy on FaithEval) and the aggregator module (+19% accuracy on FaithEval) significantly contribute to MAC-RAG's performance, particularly improving precision by effectively filtering misinformation.
- Analysis showed MAC-RAG is more robust than baselines to varying numbers of supporting documents (less drop in performance with imbalance) and increasing levels of misinformation (degrades less severely).

Practical Implementation Considerations

Computational Cost: MAC-RAG requires multiple LLM calls per query: $N$ agent calls + 1 aggregator call per round, for potentially multiple rounds ( $T$ ). This is significantly more expensive than standard RAG (1 call) or Astute RAG (few calls). Early stopping helps mitigate this, but the cost scales with the number of documents $N$ and rounds $T$ .
Prompting: Specific prompts are needed for the agents (focusing on their single document initially, then incorporating the summary) and the aggregator (synthesizing diverse inputs, explaining reasoning). Examples are provided in the paper's appendix.
Scalability: The approach might face challenges with very large numbers of documents ( $N$ ), as the aggregator needs to process $N$ potentially lengthy responses, and the summary passed back to agents could grow complex.
Trade-offs: There's a clear trade-off between computational cost and robustness. MAC-RAG invests more computation to achieve better handling of complex conflicts. The ablation paper also highlights a precision/recall trade-off managed by the aggregator.
Use Cases: Best suited for applications where handling ambiguity correctly (presenting multiple valid options) and robustness to misinformation are critical, even at a higher computational cost, such as complex information synthesis, research assistance tools, or fact-checking systems.

Code and Data: The RAMDocs dataset and MAC-RAG code are available at: https://github.com/HanNight/RAMDocs.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_reachsumit/status/1913058756302180710

https://twitter.com/TheTuringPost/status/1915207384240365825

https://twitter.com/HanWang98/status/1913257833115566101

https://twitter.com/GptMaestro/status/1920300880030535829

YouTube

Show All Videos