Uncertainty Estimation in LLM-Based Bandits

Updated 18 December 2025

The paper introduces uncertainty estimation methods, including Laplace Approximation, MC Dropout, and Epinet, to enhance exploration in LLM-based contextual bandits.
It details scalable Bayesian inference techniques, such as diagonal and last-layer approximations, to robustly quantify epistemic uncertainty and minimize cumulative regret.
Empirical evaluations on datasets like Hate Speech and IMDb show that uncertainty-aware Thompson Sampling significantly outperforms greedy policies in balancing exploration and exploitation.

Uncertainty estimation in LLM-based contextual bandits addresses the critical challenge of quantifying epistemic uncertainty in sequential decision-making tasks where context is provided in natural language. While recent LLM-based agents commonly rely on greedy reward prediction, uncertainty-aware policies—such as those leveraging Thompson Sampling—integrate principled probabilistic modeling, leading to fundamentally improved exploration-exploitation tradeoffs. This topic is exemplified in the setting of batch contextual bandits, where scalable approximations to Bayesian inference—Laplace Approximation, Monte Carlo Dropout, and Epistemic Neural Networks—can be layered atop LLMs to robustly estimate uncertainty and minimize cumulative regret (Felicioni et al., 2024).

1. Formalization of LLM-Based Contextual Bandits

A batch contextual bandit with text-based contexts is characterized as follows. At each round $t=1,\ldots,T$ , the agent observes a batch of $B$ contexts $x_t^1,\ldots,x_t^B\in\mathcal X$ , each presented as a natural language input. For each context $x_t^b$ , the agent selects an action $a_t^b\in\mathcal A$ , where $|\mathcal A|=K$ , and receives a real-valued reward $r_t^b\in\mathbb R$ . The expected reward is modeled via a parametric function $f_\theta:\mathcal X\times\mathcal A\rightarrow\mathbb R$ :

$r = f_\theta(x, a) + \epsilon, \quad \epsilon\sim\mathcal N(0, \sigma_{\mathrm{obs}}^2),$

so that $\mathbb E[r|x,a]=f_\theta(x,a)$ .

The objective is the minimization of average cumulative regret:

$R_T = \frac{1}{T} \sum_{t=1}^T \left[r_t^* - \frac{1}{B} \sum_{b=1}^B r_t^b\right],$

where $r_t^*$ denotes the expected reward of the optimal policy on round $t$ (Felicioni et al., 2024).

2. Greedy LLM Bandit Baseline

The standard baseline employs a pre-trained LLM as a feature extractor $\tilde\pi_{\theta_{\mathrm{PT}}}(x)\in\mathbb R^d$ , feeding its output into a linear layer to produce $f_\theta(x) = \mathrm{Linear}(\tilde\pi_{\theta_{\mathrm{PT}}}(x))\in\mathbb R^K$ , so that $f_\theta(x,a)$ is the $a$ -th coordinate. The greedy policy is thus:

$a_{\mathrm{greedy}}(x) = \arg\max_{a\in\mathcal A} f_{\hat\theta}(x,a),$

where the parameter $\hat\theta$ is estimated by minimizing an $\ell_2$ -regularized mean squared error (MSE) loss:

$L(\theta; \mathcal D_t) = \sum_{b=1}^B (r_t^b - f_\theta(x_t^b, a_t^b))^2 + \lambda \|\theta - \theta_{t-1}\|_2^2,$

with regularization $\lambda=\sigma_{\mathrm{obs}}^2/\sigma_p^2$ enforcing a Gaussian prior of variance $\sigma_p^2$ on $\theta$ centered at $\theta_{t-1}$ (Felicioni et al., 2024).

3. Scalable Uncertainty Estimation Methods

Uncertainty modeling is a prerequisite for effective exploration in contextual bandits and is requisite for sampling-based policies such as Thompson Sampling. Three mechanisms for approximate posterior inference in LLM-based bandits are examined:

3.1 Laplace Approximation (LA)

Laplace Approximation takes a Bayesian perspective, constructing a local Gaussian approximation to the posterior:

Compute $\theta_{\mathrm{MAP}} = \arg\min_\theta L(\theta; \mathcal D)$ .
Approximate negative log posterior by a second-order Taylor expansion around $\theta_{\mathrm{MAP}}$ :

$L(\theta; \mathcal D) \approx L(\theta_{\mathrm{MAP}}; \mathcal D) + \frac{1}{2} (\theta - \theta_{\mathrm{MAP}})^\top H (\theta - \theta_{\mathrm{MAP}}), \quad H=\nabla^2_\theta L(\theta; \mathcal D)|_{\theta_{\mathrm{MAP}}}$

Yields $p(\theta|\mathcal D) \approx \mathcal N(\theta_{\mathrm{MAP}}, H^{-1})$ .

For LLMs, scalable computation is obtained via:

Recursive Hessian updates: $H^{(1:t)} = H_l^{(t)} + H^{(1:t-1)} + H_p$ .
Fisher-Hessian approximation: use diagonal $H_l = \text{diag}\left(\sum_{\text{data}} (\nabla_\theta f_\theta(x, a))^2/\sigma_{\mathrm{obs}}^2\right)$ .
Last-layer LA: compute $H$ only for the linear output layer, keeping the LLM backbone fixed during posterior sampling (Felicioni et al., 2024).

3.2 Monte Carlo Dropout

Applying dropout at inference is interpreted as sampling from a variational posterior $q(\theta)$ :

At each Thompson Sampling decision, sample a dropout mask $p_\mathrm{drop}$ across the network, yielding a thinned parameter $\hat\theta$ .
Estimate predictive uncertainty by aggregating Monte Carlo samples:

$p(r|x,a, \mathcal D) \approx \frac{1}{M}\sum_{i=1}^M \mathcal N(r; f_{\theta^{(i)}}(x,a), \sigma_{\mathrm{obs}}^2)$

This mechanism implements efficient, scalable uncertainty quantification with minimal modifications to pre-trained LLMs (Felicioni et al., 2024).

3.3 Epistemic Neural Networks (Epinets)

Epinets integrate an auxiliary network $epi_\eta$ that takes as input the extracted base features and a random "epistemic index" $z\sim P_Z$ :

$g_{\theta,\eta}(x,a;z) = f_\theta(x,a) + epi_\eta([\mathrm{sg}(\tilde\pi_\theta(x)), z])^\top z$

The training objective is:

$L(\theta, \eta; \mathcal D) = \sum (r - g_{\theta,\eta}(x,a;z))^2 + \lambda \|\theta-\theta_{\mathrm{prev}}\|_2^2,$

with $z$ re-sampled per data point. Inference samples $z$ to mimic Thompson Sampling, even in absence of explicit Bayesian formulation (Felicioni et al., 2024).

4. Thompson Sampling with Uncertainty Estimates

Thompson Sampling (TS) with parameter uncertainty is executed per round:

Observe batch $x_t^1, ..., x_t^B$ .
Sample parameters $\tilde\theta\sim p(\theta|\mathcal D_{t-1})$ using one of the aforementioned approximations.
For each $b$ , select $a_t^b = \arg\max_a f_{\tilde\theta}(x_t^b,a)$ .
Observe rewards $r_t^b$ , augment dataset, and update posterior.

The process is formalized as:

\begin{algorithmic}[1]
  \Require prior %%%%47%%%%
  \For{%%%%48%%%%}
    \State observe contexts %%%%49%%%%
    \State sample %%%%50%%%%
    \For{%%%%51%%%%}
      \State %%%%52%%%%
      \State observe reward %%%%53%%%%
    \EndFor
    \State update %%%%54%%%%
    \State update posterior approximation %%%%55%%%%
  \EndFor
\end{algorithmic}

(Felicioni et al., 2024)

5. Empirical Evaluation and Key Findings

Empirical evaluation utilizes the "Measuring Hate Speech" dataset (≈136k comments) with each round presenting $B=32$ comments. Actions are "publish" or "not-publish" with reward assignments: publishing a non-toxic comment yields $+1$ , publishing a toxic comment yields $-0.5$ , and not-publishing any comment yields $+0.5$ .

Protocols employ $T=100$ rounds and 20 random seeds, with fine-tuning for 50 epochs per round using Adam ( $\text{lr}=3\times10^{-5}$ ). Models assessed include:

Greedy (no uncertainty estimation)
TS with Dropout
TS with diagonal Fisher Laplace Approximation ("Diag. LA")
TS with last-layer full Laplace Approximation ("Last LA")
TS with Epinet

Key findings:

All TS variants, regardless of uncertainty estimation strategy, achieve substantially lower average regret than greedy.
TS methods exhibit tighter confidence intervals; greedy incurs higher variance and in some seeds nearly constant regret.
CDF analysis over random seeds shows TS dominates greedy both in worst-case and average performance.
Action-selection ratio curves demonstrate greedy's under-exploration relative to TS methods, which achieve more balanced action distribution.

An additional IMDb dataset experiment confirms the qualitative pattern: epistemic uncertainty improves exploration and regret minimization (Felicioni et al., 2024).

6. Theoretical Insights and Practical Implications

Thompson Sampling, when the posterior is exact, attains near-optimal regret bounds, e.g., $R_T=O(\sqrt{T})$ for linear bandits (Agrawal & Goyal 2017). Even approximate posteriors that capture core epistemic uncertainty can result in significantly improved empirical exploration compared to the greedy policy. Epistemic uncertainty drives exploration toward actions with high reward uncertainty; in contrast, deterministic greedy strategies risk premature convergence on suboptimal arms (Felicioni et al., 2024).

A plausible implication is that scalable, approximate uncertainty estimation can be deployed in LLM-based decision-making tasks with minimal incremental cost and substantial improvement in online performance. Even simple MC-Dropout, without re-tuning dropout probability, is highly competitive. More sophisticated methods (e.g., Laplace Approximation, Epinet) can further enhance robustness, but the primary benefit arises from incorporating any epistemic uncertainty into the decision policy.

7. Conclusions

Fine-tuning LLMs for reward prediction while neglecting epistemic uncertainty leads to unreliable and suboptimal decision-making in contextual bandit settings. Incorporating scalable uncertainty estimates—via Dropout, Laplace Approximation, or Epinet—within Thompson Sampling dramatically reduces regret and ensures more robust exploration. The principal finding is that modeling epistemic uncertainty is not optional, but fundamental for safe and efficient online decision-making in LLM-driven bandit problems (Felicioni et al., 2024).

PDF Markdown Chat (Pro)

References (1)

On the Importance of Uncertainty in Decision-Making with Large Language Models (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Uncertainty Estimation in LLM-Based Contextual Bandits.