Papers
Topics
Authors
Recent
Search
2000 character limit reached

BayesRAG: Bayesian Enhancements in RAG

Updated 13 January 2026
  • BayesRAG is a Bayesian inference-based enhancement to Retrieval-Augmented Generation that integrates probabilistic evidence fusion to improve quality and consistency.
  • It leverages domain-informed priors, LLM-derived likelihoods, and Dempster–Shafer theory to assess and select the most informative context over multiple modalities.
  • Experimental evaluations reveal significant gains in QA accuracy, retrieval recall, and robustness across unimodal and multimodal document settings.

BayesRAG refers to a family of Bayesian inference-based enhancements to Retrieval-Augmented Generation (RAG) architectures for both unimodal (text) and multimodal (text-image) document question answering. Classical RAG relies primarily on vector similarity retrieval to select context chunks, but this approach does not account for the variable informativeness of retrieved chunks, conflicting evidence, or cross-modal corroboration. BayesRAG injects a probabilistic evidence fusion step between retrieval and LLM generation, using Bayesian principles, Dempster–Shafer theory, and domain-informed priors to optimize the relevance and consistency of selected context. This yields statistically significant improvements in QA accuracy, retrieval recall, and the robustness of final responses, especially in document-heavy or multimodal settings (Rao, 2024, Li et al., 12 Jan 2026).

1. Bayesian Formulation and Motivation

In standard RAG, a user query QQ is used to retrieve top-kk chunks from a corpus based on embedding similarity, with little post-processing beyond concatenation. The information content and trustworthiness of these chunks often varies, and conflicting or irrelevant passages can degrade LLM responses. BayesRAG reframes chunk selection as inference under Bayes’ theorem, modeling the posterior probability that a chunk (or tuple) will contribute to a high-quality answer. Specifically, for a candidate evidence (chunk or multimodal tuple) EE:

P(quality=1E)=P(Equality=1)P(quality=1)P(E)P(\mathrm{quality}=1 \mid E) = \frac{P(E\mid \mathrm{quality}=1)\cdot P(\mathrm{quality}=1)}{P(E)}

In practice, P(E)P(E) is assumed uniform across top candidates and can be dropped for ranking. The score thus becomes:

score(E)Likelihood(E)×Prior(E)\mathrm{score}(E) \propto \text{Likelihood}(E) \times \text{Prior}(E)

Likelihood represents how well EE semantically answers QQ, and Prior reflects a domain-dependent confidence in EE before observing the query (Rao, 2024, Li et al., 12 Jan 2026).

2. Term Definitions and Evidence Modeling

Unimodal BayesRAG:

  • Prior P(quality=1)P(\mathrm{quality}=1):

Determined by the page number from which a text chunk is extracted. The assumption is that early document pages are more likely to contain summaries or direct answers. The schedule is linear decay for pages 1–10 (1.00.551.0\to0.55) and slow decay for subsequent pages to a floor of $0.1$ by page 60.

  • Likelihood P(Cquality=1)P(C\mid \mathrm{quality}=1):

Estimated by prompting the LLM with a zero-shot classification to rate chunk relevance on [0.0,1.0][0.0, 1.0]. The LLM’s internal classification head converts this into a normalized likelihood score; temperature is set to $0$ for deterministic output.

  • Deduplication:

Near-duplicate paragraphs (e.g., Jaccard overlap >0.7>0.7) are filtered to prevent redundancy.

Multimodal BayesRAG:

  • Candidate Tuples:

Pair text chunks tit_i and images vjv_j, retrieved independently as top-KtextK_\text{text} and top-KimageK_\text{image}.

  • Semantic Similarity:

Each modality provides a score Stext,Simage[0,1]S_\text{text}, S_\text{image} \in [0,1] (normalized cosine similarity).

  • Likelihoood via Dempster–Shafer Fusion:

Each score is converted into a basic probability assignment (mass function) over “relevant,” “irrelevant,” and “uncertainty.” These are fused using Dempster’s rule to penalize conflicting evidence and to model combined belief, with the final probability extracted through the pignistic transformation:

P(QE)BetPfinal(Y)=mfinal(Y)+12mfinal(Ω)P(Q|E) \approx \text{BetP}_\text{final}(Y) = m_\text{final}(Y) + \frac{1}{2} m_\text{final}(\Omega)

  • Consistency Prior P(E)P(E):
    • Layout prior: P(E)=1.0P(E) = 1.0 if bounding boxes (text/image) are on the same/adjacent pages and close; else ϵ=0.1\epsilon = 0.1.
    • Graph prior: Build a multimodal knowledge graph and convert edge weights to probabilities, giving P(E)P(E) as the edge probability between tit_i and vjv_j (Li et al., 12 Jan 2026).

3. Algorithmic Pipeline

The information flow in BayesRAG is outlined in the following sequence:

Step Brief Description
Vector Search / Retrieval Retrieve initial set of top-kk (unimodal) or top-KK per modality
Compute Prior for Each Candidate Unimodal: page-based; Multimodal: layout or graph topology
Compute Likelihood / Belief Mass Unimodal: LLM rating; Multimodal: DS fusion of similarity scores
Posterior Score Calculation Multiply likelihood and prior
Deduplicate (Unimodal) / Pairwise Fusion Remove near duplicates or semantically distant pairs
Context Assembly Select top-nn ranked by posterior score
LLM Generation Prompt LLM with the pruned, ranked context

This pipeline enhances standard RAG by inserting an explicit evidence quality filter before final generation (Rao, 2024, Li et al., 12 Jan 2026).

4. Experimental Evaluation and Empirical Results

Unimodal Evaluation

  • Datasets:

Three proprietary enterprise domains (≈10M pages: financial, technical, policy); Wikipedia subset (≈2M pages) for ablation.

  • Baselines:

Standard RAG (vector-only), TF-IDF re-ranking, LLM re-ranking (likelihood only).

  • Metrics:

F1/EM on QA pairs, LLM-judge (GPT-4 1–5), Human expert 3-point scale.

  • Results:
Setting Standard RAG BayesRAG Absolute Improvement
Wikipedia QA (EM) 32 44 +12 points
Wikipedia F1 48 63 +15 points
Enterprise LLM Good Answer 40% 52% +30% relative
Human-Correct QA 22% 47% +25 points
  • Ablations:

Prior-only: +5 F1; Likelihood-only: +8 F1; Both: +15 F1 (Rao, 2024).

Multimodal Evaluation

  • Datasets:

DocBench (229 docs, 1,102 QA), MMLongBench-Doc (135 docs, 1,082 Q).

  • Metrics:

Retrieval recall @N, rule-Acc, GPT-Score.

  • Results:
Model/Setting DocBench Overall (%) MMLongBench-Doc Rule-Acc / GPT-Score
BayesRAG 51.2 38.8 / 44.1
RAGAnything 48.7 37.1 / 41.5
ViDoRAG 43.5
Embedding-RAG (ablation) 32.7 (no prior, no Bayes)
Layout+DS (ablation) 36.8
Graph+linear (ablation) 40.6
Full BayesRAG 44.1

Layout prior alone yields +4.1% GPT-Score, linear fusion with graph prior +7.9%, and DS fusion with graph prior a further +3.5%. Retrieval Recall@1 increased from 24.4% → 27.5%, Recall@20 from 56.6% → 76.6% (Li et al., 12 Jan 2026).

5. Limitations and Design Considerations

  • Uniform marginal assumption:

P(C)P(C) or P(E)P(E) is held uniform; this ignores possible chunk frequency or length biases. A fully normalized model could assign more complex weights, but empirical results suggest uniformity suffices for top-kk.

  • Page-based Prior Limitations:

The hypothesis that early pages better answer queries is domain-dependent and may underperform in highly modular or reference-dense documents.

  • Reliance on LLM calibration:

The LLM’s zero-shot score for likelihood can be unstable; lightweight classifier head fine-tuning may yield more robust scoring.

  • Multimodal constraints:

Corroborative power drops in purely textual corpora or in continuous tables where structure is less relevant. Layout and graph priors contribute minimally for such data.

  • Computational Overhead:

Scoring a large kk with LLM or performing mass fusion over many pairs introduces latency; surrogate classifiers or asynchronous retrieval/fusion are suggested extensions (Rao, 2024, Li et al., 12 Jan 2026).

6. Extensions and Future Directions

  • Learning dynamic or dataset-specific priors, potentially per document type or user feedback loops.
  • End-to-end fine-tuning of Bayesian fusion weights and Dempster–Shafer masses.
  • Incorporating additional modalities (e.g., tables, audio).
  • Joint Bayesian inference over context sets, accounting for mutual information and downstream impact on LLM response.
  • Closing the retrieval–generation gap by aligning BayesRAG retrieval with next-generation multimodal LLMs, and integrating approximate/asynchronous fusion for low-latency production QA (Rao, 2024, Li et al., 12 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BayesRAG.