Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 31 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 11 tok/s Pro
GPT-5 High 9 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 463 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models (2404.12494v3)

Published 18 Apr 2024 in cs.CL

Abstract: Predictive models often need to work with incomplete information in real-world tasks. Consequently, they must provide reliable probability or confidence estimation, especially in large-scale decision-making and planning tasks. Current LLMs are insufficient for accurate estimations, but they can generate relevant factors that may affect the probabilities, produce coarse-grained probabilities when the information is more complete, and help determine which factors are relevant to specific downstream contexts. In this paper, we make use of these capabilities of LLMs to provide a significantly more accurate probabilistic estimation. We propose BIRD, a novel probabilistic inference framework that aligns a Bayesian network with LLM abductions and then estimates more accurate probabilities in a deduction step. We show BIRD provides reliable probability estimations that are 30% better than those provided directly by LLM baselines. These estimates further contribute to better and more trustworthy decision making.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Robert F Bordley. A multiplicative formula for aggregating probability assessments. Management science, 28(10):1137–1148, 1982.
  2. Plasma: Making small language models better procedural knowledge models for (counterfactual) planning, 2023.
  3. Generic temporal reasoning with differential analysis and explanation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  12013–12029, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.671. URL https://aclanthology.org/2023.acl-long.671.
  4. Bayesian Data Analysis. Chapman and Hall/CRC, Boca Ratan, Florida, 1995.
  5. Towards uncertainty-aware language agent, 2024.
  6. Training chain-of-thought via latent-variable inference. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=a147pIS2Co.
  7. Amortizing intractable inference in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Ouj6p4ca60.
  8. Language models (mostly) know what they know, 2022.
  9. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009. ISBN 0262013193.
  10. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=VD-AYtP0dve.
  11. Deceiving semantic shortcuts on reasoning chains: How far can models go without hallucination?, 2023.
  12. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. ArXiv, abs/2305.17390, 2023.
  13. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=8s8K2UZGTZ.
  14. Dellma: A framework for decision making under uncertainty with large language models, 2024.
  15. N. F. McGlynn. Thinking fast and slow. Australian veterinary journal, 92 12:N21, 2014.
  16. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872, 2022. doi: 10.1162/tacl˙a˙00494. URL https://aclanthology.org/2022.tacl-1.50.
  17. A multi-axis annotation scheme for event temporal relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1318–1328, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1122. URL https://aclanthology.org/P18-1122.
  18. Show your work: Scratchpads for intermediate computation with language models, 2021. https://arxiv.org/abs/2112.00114.
  19. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://arxiv.org/pdf/2303.08774.pdf.
  20. Robots that ask for help: Uncertainty alignment for large language model planners. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=4ZK8ODNyFXx.
  21. COM2SENSE: A commonsense reasoning benchmark with complementary sentences. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  883–898, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.78. URL https://aclanthology.org/2021.findings-acl.78.
  22. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  5433–5442, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.330. URL https://aclanthology.org/2023.emnlp-main.330.
  23. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
  24. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
  25. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gjeQKFxFpZ.
  26. React: Synergizing reasoning and acting in language models. ArXiv, abs/2210.03629, 2022.
  27. Temporal reasoning on implicit events from distant supervision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1361–1371, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.107. URL https://aclanthology.org/2021.naacl-main.107.
  28. Learning to decompose: Hypothetical question decomposition based on comparable texts. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2223–2235, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.142. URL https://aclanthology.org/2022.emnlp-main.142.
Citations (2)

Summary

  • The paper introduces BIRD, a novel Bayesian framework that enhances LLM decision-making through integrated abductive reasoning and deductive modeling.
  • It employs a three-stage process involving factor generation, entailment mapping, and probabilistic modeling to achieve superior performance and interpretability.
  • Experimental results reveal a 35% improvement over GPT-4 in preference evaluations and a 1.3% average gain in cross-domain performance.

BIRD: A Trustworthy Bayesian Inference Framework for LLMs

The paper introduces BIRD (Bayesian Inference from Abduction and Deduction), a novel Bayesian inference framework designed to enhance the reliability of LLM decision-making by integrating abductive reasoning, LLM entailment, and deductive Bayesian modeling. The framework addresses the limitations of LLMs, which primarily rely on inductive reasoning, often leading to unreliable decisions in real-world scenarios with incomplete information. BIRD aims to provide controllable and interpretable probability estimations for model decisions, thereby improving their trustworthiness.

Core Components of BIRD

Figure 1

Figure 1: Two examples of temporal reasoning and planning. GPT 4 estimates the same probabilities for two different conditions for both examples, while BIRD (ours) successfully distinguishes the difference and can thus help the user make a more proper decision.

BIRD operates through three main stages: abductive factor generation, LLM entailment for context-factor mapping, and deductive Bayesian probabilistic modeling.

  1. Abductive Factor Generation: LLMs conceptualize the input query into relevant factors, creating an intermediate symbolic representation. This involves generating sentences that describe situations increasing the likelihood of different outcomes and summarizing these sentences into structured factors with corresponding values.
  2. LLM Entailment: LLMs map the given context, including a scenario and additional conditions, to the factors identified in the previous stage. This process uses LLM entailment to determine which factors are implied by the provided information, ensuring consistent mapping to the same factor structure.
  3. Deductive Bayesian Probabilistic Modeling: An external, learnable text-based Bayesian model is employed to align LLM decisions and estimate outcome probabilities based on the identified factors. This model uses the law of total probability to differentiate between world modeling and observations, enhancing the reliability of probability estimations. Figure 2

    Figure 2: An overview of BIRD. Given a scenario, we first conduct abductive factor generation, followed by LLM classification for factor-outcome mapping. The factor values in blue support outcome 1 and those in yellow support outcome 2. We learn an algorithm with LLM coarse classification decisions for the Bayesian modeling to enable better alignment. During inference, given an additional condition, we use LLM entailment for context-factor mapping and then estimate probabilities through the trained Bayesian model using the complete information space.

Mathematical Formulation

The framework's mathematical formulation centers on estimating the probability of an outcome OiO_i given a context CC, which includes a scenario SS and additional condition UU. Instead of direct induction (X→YX \rightarrow Y), BIRD employs abduction (X→ZX \rightarrow Z) to conceptualize input queries into intermediate factors, followed by deduction (X,Z→YX, Z \rightarrow Y) to fit a Bayesian model.

The predictive probability is derived by marginalizing over the complete information space F\mathcal{F}:

P(Oi∣C)=∑f∈FP(Oi∣f)P(f∣C)\mathbb{P}(O_i|C) = \sum_{f\in \mathcal{F}} \mathbb{P}(O_i|f)\mathbb{P}(f|C)

where ff represents a specific instance in the information space, and P(Oi∣f)\mathbb{P}(O_i|f) and P(f∣C)\mathbb{P}(f|C) denote the world preferences and observations, respectively.

Experimental Results

The effectiveness of BIRD was evaluated across three datasets focused on reasoning and planning: Com2Sense, a temporal reasoning dataset, and PlaSMa. Experiments were conducted using the Llama-2-70b-instruct model. The framework's performance was assessed intrinsically through the reliability of estimated probabilities and its ability to make hard-label decisions, and extrinsically via its utility in generating training signals and follow-up questions.

BIRD demonstrated superior alignment with human preferences, achieving an F1 score of 65%, surpassing GPT-4 by 35% in preference-based pairwise evaluations. The framework also showed comparable performance to chain-of-thought (CoT) methods in decision-making tasks, while providing enhanced interpretability and controllability. Furthermore, BIRD's probability estimations served as reliable training signals, leading to a 1.3% average performance increase on cross-domain datasets when fine-tuning a T5-large model. The framework also facilitated the generation of more humanly preferable follow-up questions, indicating its potential for interactive agent systems.

Ablation Studies

Ablation studies were conducted to assess the impact of different components within the BIRD framework. The results indicated that the proposed abductive sampling method outperformed direct factor generation, improving accuracy by 4.4% and reducing the unknown answer rate by 16.9%. Additionally, the learnable Bayesian modeling component, particularly when trained with reliable signals from LLMs, contributed positively to uncertainty calibration, enhancing the overall performance.

Implications and Future Directions

BIRD offers a significant advancement in enhancing the trustworthiness and reliability of LLM decision-making. By integrating abductive and deductive reasoning, the framework provides interpretable and controllable probability estimations, addressing the limitations of purely inductive approaches. The results suggest that BIRD can be effectively applied in real-world applications requiring robust decision-making under uncertainty. Future research directions include exploring improved factor generation techniques, expanding the framework to handle more complex scenarios, and integrating BIRD into interactive agent systems for more efficient and controlled decision-making processes.

Conclusion

The BIRD framework presents a valuable approach for improving the reliability and trustworthiness of LLMs in decision-making contexts. By combining abductive factor generation, LLM entailment, and deductive Bayesian modeling, BIRD achieves state-of-the-art performance in aligning with human judgments and generating reliable probability estimations. The framework's modular design and promising experimental results highlight its potential for broader applications in AI and NLP.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube