Papers
Topics
Authors
Recent
Search
2000 character limit reached

On the Importance of Uncertainty in Decision-Making with Large Language Models

Published 3 Apr 2024 in cs.LG | (2404.02649v2)

Abstract: We investigate the role of uncertainty in decision-making problems with natural language as input. For such tasks, using LLMs as agents has become the norm. However, none of the recent approaches employ any additional phase for estimating the uncertainty the agent has about the world during the decision-making task. We focus on a fundamental decision-making framework with natural language as input, which is the one of contextual bandits, where the context information consists of text. As a representative of the approaches with no uncertainty estimation, we consider an LLM bandit with a greedy policy, which picks the action corresponding to the largest predicted reward. We compare this baseline to LLM bandits that make active use of uncertainty estimation by integrating the uncertainty in a Thompson Sampling policy. We employ different techniques for uncertainty estimation, such as Laplace Approximation, Dropout, and Epinets. We empirically show on real-world data that the greedy policy performs worse than the Thompson Sampling policies. These findings suggest that, while overlooked in the LLM literature, uncertainty plays a fundamental role in bandit tasks with LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Near-optimal regret bounds for thompson sampling. Journal of the ACM (JACM), 64(5):1–24, 2017.
  2. Bayesian inference in statistical analysis. John Wiley & Sons, 2011.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  5. Grounding large language models in interactive environments with online reinforcement learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  3676–3713. PMLR, 2023. URL https://proceedings.mlr.press/v202/carta23a.html.
  6. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24, 2011.
  7. Introspective tips: Large language model for in-context decision making. arXiv preprint arXiv:2305.11598, 2023.
  8. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  9. Laplace redux-effortless bayesian deep learning. Advances in Neural Information Processing Systems, 34:20089–20103, 2021.
  10. Yarin Gal. Uncertainty in deep learning. 2016.
  11. Bayesian convolutional neural networks with bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158, 2015.
  12. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp.  1050–1059. PMLR, 2016.
  13. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023.
  14. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  15. Algorithmic content moderation: Technical and political challenges in the automation of platform governance. Big Data & Society, 7(1):2053951719897945, 2020.
  16. Alex Graves. Practical variational inference for neural networks. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips.cc/paper_files/paper/2011/file/7eb3c8be3d411e8ebfab08eba5f49632-Paper.pdf.
  17. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. arXiv preprint arXiv:2305.11554, 2023.
  18. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  19. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110:457–506, 2021.
  20. Parallelised bayesian optimisation via thompson sampling. In International Conference on Artificial Intelligence and Statistics, pp.  133–142. PMLR, 2018.
  21. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017.
  22. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  23. Motif: Intrinsic motivation from artificial intelligence feedback. arXiv preprint arXiv:2310.00166, 2023.
  24. Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.
  25. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  26. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
  27. “how advertiser-friendly is my video?”: Youtuber’s socioeconomic interactions with algorithmic content moderation. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2):1–25, 2021.
  28. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp.  142–150, 2011.
  29. David JC MacKay. Bayesian interpolation. Neural computation, 4(3):415–447, 1992.
  30. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
  31. OpenAI. Gpt-4 technical report, 2023.
  32. Epistemic neural networks. Advances in Neural Information Processing Systems, 36, 2023a.
  33. Approximate thompson sampling via epistemic neural networks. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI ’23. JMLR.org, 2023b.
  34. Training language models to follow instructions with human feedback, 2022.
  35. Improving language understanding by generative pre-training. 2018.
  36. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  37. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  38. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In International Conference on Learning Representations, 2018.
  39. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
  40. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  41. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  42. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  43. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
  44. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  45. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  46. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  47. “at the end of the day facebook does what it wants” how users experience contesting algorithmic content moderation. Proceedings of the ACM on human-computer interaction, 4(CSCW2):1–22, 2020.
  48. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  49. Bandit problems with side observations. IEEE Transactions on Automatic Control, 50(3):338–355, 2005.
  50. Explicit inductive bias for transfer learning with convolutional networks. In International Conference on Machine Learning, pp.  2825–2834. PMLR, 2018.
  51. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752, 2023a.
  52. Foundation models for decision making: Problems, methods, and opportunities. arXiv preprint arXiv:2303.04129, 2023b.
Citations (2)

Summary

  • The paper demonstrates that leveraging epistemic uncertainty via Thompson Sampling in LLM-based contextual bandits significantly reduces cumulative regret.
  • It adapts scalable uncertainty estimation techniques such as Dropout, Laplace Approximation, and Epinets for effective decision-making.
  • Empirical results validate that TS-based policies achieve balanced exploration and enhanced performance over greedy methods in real-world tasks.

Uncertainty Estimation in LLM-Based Contextual Bandits: Empirical and Algorithmic Insights

Introduction

This paper rigorously investigates the role of epistemic uncertainty in decision-making tasks where natural language is the input, specifically focusing on contextual bandit problems solved with LLMs. The authors contrast the standard greedy approach—selecting actions based solely on the highest predicted reward—with policies that actively estimate and leverage uncertainty via Thompson Sampling (TS). They adapt scalable uncertainty estimation techniques, including Dropout, Laplace Approximation (LA), and Epinets, to the LLM setting and empirically demonstrate that TS-based policies consistently outperform greedy baselines in real-world bandit tasks. The work provides both algorithmic adaptations and strong empirical evidence for the necessity of uncertainty modeling in LLM-driven decision-making.

Contextual Bandit Formulation with LLMs

The contextual bandit framework considered involves sequential decision-making over batches of text contexts, with the agent selecting actions and receiving rewards. The regret is defined as the difference between the cumulative reward of the learned policy and that of an optimal policy, normalized by the number of time steps. The agent's goal is to minimize this regret.

LLMs are used as feature extractors, with a linear regression head predicting expected rewards for each action. The model is fine-tuned on observed data, with the pre-trained weights serving as the prior for regularization. This setup enables leveraging the representational power of LLMs while adapting them for reward prediction in bandit settings.

Uncertainty Types and Their Relevance

The paper distinguishes between aleatoric uncertainty (irreducible noise in the data) and epistemic uncertainty (uncertainty in model parameters due to limited data). While aleatoric uncertainty is modeled as Gaussian noise in the reward, epistemic uncertainty is critical for exploration in bandit problems. Maintaining a posterior over model parameters allows the agent to quantify what it does not know, which is essential for balancing exploration and exploitation.

Decision Policies: Greedy vs. Thompson Sampling

The greedy policy selects actions with the highest predicted reward, updating model parameters via regularized MSE loss after each batch. This approach ignores uncertainty and is prone to suboptimal exploration, often leading to high variance and persistent selection of suboptimal actions.

Thompson Sampling, in contrast, samples model parameters from the posterior distribution at each decision point, selecting actions according to the sampled model. This probabilistic approach naturally balances exploration and exploitation, with the posterior concentrating as more data is observed.

Scalable Epistemic Uncertainty Estimation for LLMs

Given the computational constraints of LLMs, the paper adapts three scalable uncertainty estimation techniques:

  • Dropout: Applied during both training and action selection, dropout enables approximate posterior sampling without additional memory or training overhead. The dropout rate used during LLM pre-training is reused, and ablation studies confirm its effectiveness.
  • Laplace Approximation (LA): The posterior over parameters is approximated as a Gaussian centered at the MAP estimate, with covariance given by the inverse Hessian of the loss. To address computational infeasibility, the authors employ recursive Hessian updates, diagonal approximations, and Fisher matrix approximations. Additionally, a last-layer LA variant computes the full Hessian only for the final regression layer, reducing memory and computation while maintaining exploration quality.
  • Epinets: An auxiliary neural network estimates epistemic uncertainty by taking LLM features and a random epistemic index as input. The combined prediction of the base model and epinet is used for TS. The architecture is kept lightweight to avoid excessive overhead.

Empirical Evaluation

The primary experimental task is automated content moderation, using the "measuring hate speech" dataset. The agent must decide whether to publish or not publish user comments, with asymmetric rewards reflecting the risk of publishing toxic content. All models are initialized with GPT2 (124M parameters) and trained with Adam optimizer.

Regret Analysis

Empirical results show that all TS-based policies (Dropout, Diag. LA, Last LA, Epinet TS) achieve significantly lower average regret than the greedy baseline. Figure 1

Figure 1

Figure 1: Average regret (±\pm std. err.) obtained on the toxic content detection bandit task.

The confidence interval for the greedy policy is notably larger, indicating high variance and frequent under-exploration. CDF analysis of cumulative regret further reveals that greedy policies can suffer from persistent suboptimal action selection, especially in runs requiring balanced exploration.

Action Selection Dynamics

Analysis of action selection ratios demonstrates that greedy policies often get stuck in suboptimal arms, while TS policies maintain more balanced exploration. Figure 2

Figure 2: Action selection ratio for the action ``publish'' for two particular sample runs.

Dropout Rate Ablation

Ablation experiments on dropout rates confirm that the pre-training dropout rate is near-optimal for the bandit task, with lower rates leading to increased variance due to reduced exploration. Figure 3

Figure 3: Average regret obtained with different dropout values.

Generalization to Other Datasets

Experiments on the IMDb dataset corroborate the findings: TS policies, especially Last LA and Epinet TS, achieve lower regret than greedy policies, even when hyperparameters are not retuned for the new task. Figure 4

Figure 4

Figure 4: Average regret (±\pm std. err.) obtained on the IMDb dataset.

Algorithmic and Practical Implications

The results provide strong evidence that epistemic uncertainty estimation is essential for effective exploration in LLM-based contextual bandits. Dropout and LA variants are practical for large models, requiring minimal changes to training or inference. Epinets offer a flexible, non-Bayesian alternative, though their performance is sensitive to architectural choices.

From a deployment perspective, these techniques enable scalable uncertainty-aware decision-making in real-world systems, such as automated moderation, recommendation, and interactive agents. The use of pre-trained weights as priors and leveraging pre-training dropout rates are practical strategies for integrating uncertainty estimation into existing LLM pipelines.

Theoretical Implications and Future Directions

The work highlights the limitations of greedy policies in high-dimensional, stochastic environments and demonstrates the necessity of uncertainty modeling for regret minimization. The adaptation of LA and dropout to LLMs opens avenues for further research on scalable Bayesian inference in deep models. The performance variability of Epinets suggests that architecture search and integration with other uncertainty estimation methods could yield further improvements.

Future research may explore:

  • More efficient Hessian approximations for LA in even larger models.
  • Hybrid uncertainty estimation techniques combining Bayesian and non-Bayesian approaches.
  • Application to reinforcement learning and sequential decision-making beyond bandits.
  • Robustness analysis under distributional shift and adversarial contexts.

Conclusion

This paper provides a comprehensive empirical and algorithmic study of uncertainty estimation in LLM-based contextual bandits. By adapting scalable techniques and demonstrating their superiority over greedy baselines, the authors establish epistemic uncertainty as a fundamental component for exploration and regret minimization in decision-making tasks with natural language input. The findings have direct implications for the design and deployment of LLM agents in real-world systems, advocating for the centrality of uncertainty modeling in future AI development.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.