BayesRAG: Bayesian Enhancements in RAG
- BayesRAG is a Bayesian inference-based enhancement to Retrieval-Augmented Generation that integrates probabilistic evidence fusion to improve quality and consistency.
- It leverages domain-informed priors, LLM-derived likelihoods, and Dempster–Shafer theory to assess and select the most informative context over multiple modalities.
- Experimental evaluations reveal significant gains in QA accuracy, retrieval recall, and robustness across unimodal and multimodal document settings.
BayesRAG refers to a family of Bayesian inference-based enhancements to Retrieval-Augmented Generation (RAG) architectures for both unimodal (text) and multimodal (text-image) document question answering. Classical RAG relies primarily on vector similarity retrieval to select context chunks, but this approach does not account for the variable informativeness of retrieved chunks, conflicting evidence, or cross-modal corroboration. BayesRAG injects a probabilistic evidence fusion step between retrieval and LLM generation, using Bayesian principles, Dempster–Shafer theory, and domain-informed priors to optimize the relevance and consistency of selected context. This yields statistically significant improvements in QA accuracy, retrieval recall, and the robustness of final responses, especially in document-heavy or multimodal settings (Rao, 2024, Li et al., 12 Jan 2026).
1. Bayesian Formulation and Motivation
In standard RAG, a user query is used to retrieve top- chunks from a corpus based on embedding similarity, with little post-processing beyond concatenation. The information content and trustworthiness of these chunks often varies, and conflicting or irrelevant passages can degrade LLM responses. BayesRAG reframes chunk selection as inference under Bayes’ theorem, modeling the posterior probability that a chunk (or tuple) will contribute to a high-quality answer. Specifically, for a candidate evidence (chunk or multimodal tuple) :
In practice, is assumed uniform across top candidates and can be dropped for ranking. The score thus becomes:
Likelihood represents how well semantically answers , and Prior reflects a domain-dependent confidence in before observing the query (Rao, 2024, Li et al., 12 Jan 2026).
2. Term Definitions and Evidence Modeling
Unimodal BayesRAG:
- Prior :
Determined by the page number from which a text chunk is extracted. The assumption is that early document pages are more likely to contain summaries or direct answers. The schedule is linear decay for pages 1–10 () and slow decay for subsequent pages to a floor of $0.1$ by page 60.
- Likelihood :
Estimated by prompting the LLM with a zero-shot classification to rate chunk relevance on . The LLM’s internal classification head converts this into a normalized likelihood score; temperature is set to $0$ for deterministic output.
- Deduplication:
Near-duplicate paragraphs (e.g., Jaccard overlap ) are filtered to prevent redundancy.
Multimodal BayesRAG:
- Candidate Tuples:
Pair text chunks and images , retrieved independently as top- and top-.
- Semantic Similarity:
Each modality provides a score (normalized cosine similarity).
- Likelihoood via Dempster–Shafer Fusion:
Each score is converted into a basic probability assignment (mass function) over “relevant,” “irrelevant,” and “uncertainty.” These are fused using Dempster’s rule to penalize conflicting evidence and to model combined belief, with the final probability extracted through the pignistic transformation:
- Consistency Prior :
- Layout prior: if bounding boxes (text/image) are on the same/adjacent pages and close; else .
- Graph prior: Build a multimodal knowledge graph and convert edge weights to probabilities, giving as the edge probability between and (Li et al., 12 Jan 2026).
3. Algorithmic Pipeline
The information flow in BayesRAG is outlined in the following sequence:
| Step | Brief Description |
|---|---|
| Vector Search / Retrieval | Retrieve initial set of top- (unimodal) or top- per modality |
| Compute Prior for Each Candidate | Unimodal: page-based; Multimodal: layout or graph topology |
| Compute Likelihood / Belief Mass | Unimodal: LLM rating; Multimodal: DS fusion of similarity scores |
| Posterior Score Calculation | Multiply likelihood and prior |
| Deduplicate (Unimodal) / Pairwise Fusion | Remove near duplicates or semantically distant pairs |
| Context Assembly | Select top- ranked by posterior score |
| LLM Generation | Prompt LLM with the pruned, ranked context |
This pipeline enhances standard RAG by inserting an explicit evidence quality filter before final generation (Rao, 2024, Li et al., 12 Jan 2026).
4. Experimental Evaluation and Empirical Results
Unimodal Evaluation
- Datasets:
Three proprietary enterprise domains (≈10M pages: financial, technical, policy); Wikipedia subset (≈2M pages) for ablation.
- Baselines:
Standard RAG (vector-only), TF-IDF re-ranking, LLM re-ranking (likelihood only).
- Metrics:
F1/EM on QA pairs, LLM-judge (GPT-4 1–5), Human expert 3-point scale.
- Results:
| Setting | Standard RAG | BayesRAG | Absolute Improvement |
|---|---|---|---|
| Wikipedia QA (EM) | 32 | 44 | +12 points |
| Wikipedia F1 | 48 | 63 | +15 points |
| Enterprise LLM Good Answer | 40% | 52% | +30% relative |
| Human-Correct QA | 22% | 47% | +25 points |
- Ablations:
Prior-only: +5 F1; Likelihood-only: +8 F1; Both: +15 F1 (Rao, 2024).
Multimodal Evaluation
- Datasets:
DocBench (229 docs, 1,102 QA), MMLongBench-Doc (135 docs, 1,082 Q).
- Metrics:
Retrieval recall @N, rule-Acc, GPT-Score.
- Results:
| Model/Setting | DocBench Overall (%) | MMLongBench-Doc Rule-Acc / GPT-Score |
|---|---|---|
| BayesRAG | 51.2 | 38.8 / 44.1 |
| RAGAnything | 48.7 | 37.1 / 41.5 |
| ViDoRAG | 43.5 | — |
| Embedding-RAG (ablation) | — | 32.7 (no prior, no Bayes) |
| Layout+DS (ablation) | — | 36.8 |
| Graph+linear (ablation) | — | 40.6 |
| Full BayesRAG | — | 44.1 |
Layout prior alone yields +4.1% GPT-Score, linear fusion with graph prior +7.9%, and DS fusion with graph prior a further +3.5%. Retrieval Recall@1 increased from 24.4% → 27.5%, Recall@20 from 56.6% → 76.6% (Li et al., 12 Jan 2026).
5. Limitations and Design Considerations
- Uniform marginal assumption:
or is held uniform; this ignores possible chunk frequency or length biases. A fully normalized model could assign more complex weights, but empirical results suggest uniformity suffices for top-.
- Page-based Prior Limitations:
The hypothesis that early pages better answer queries is domain-dependent and may underperform in highly modular or reference-dense documents.
- Reliance on LLM calibration:
The LLM’s zero-shot score for likelihood can be unstable; lightweight classifier head fine-tuning may yield more robust scoring.
- Multimodal constraints:
Corroborative power drops in purely textual corpora or in continuous tables where structure is less relevant. Layout and graph priors contribute minimally for such data.
- Computational Overhead:
Scoring a large with LLM or performing mass fusion over many pairs introduces latency; surrogate classifiers or asynchronous retrieval/fusion are suggested extensions (Rao, 2024, Li et al., 12 Jan 2026).
6. Extensions and Future Directions
- Learning dynamic or dataset-specific priors, potentially per document type or user feedback loops.
- End-to-end fine-tuning of Bayesian fusion weights and Dempster–Shafer masses.
- Incorporating additional modalities (e.g., tables, audio).
- Joint Bayesian inference over context sets, accounting for mutual information and downstream impact on LLM response.
- Closing the retrieval–generation gap by aligning BayesRAG retrieval with next-generation multimodal LLMs, and integrating approximate/asynchronous fusion for low-latency production QA (Rao, 2024, Li et al., 12 Jan 2026).