Papers
Topics
Authors
Recent
2000 character limit reached

Uncertainty Estimation in LLM-Based Bandits

Updated 18 December 2025
  • The paper introduces uncertainty estimation methods, including Laplace Approximation, MC Dropout, and Epinet, to enhance exploration in LLM-based contextual bandits.
  • It details scalable Bayesian inference techniques, such as diagonal and last-layer approximations, to robustly quantify epistemic uncertainty and minimize cumulative regret.
  • Empirical evaluations on datasets like Hate Speech and IMDb show that uncertainty-aware Thompson Sampling significantly outperforms greedy policies in balancing exploration and exploitation.

Uncertainty estimation in LLM-based contextual bandits addresses the critical challenge of quantifying epistemic uncertainty in sequential decision-making tasks where context is provided in natural language. While recent LLM-based agents commonly rely on greedy reward prediction, uncertainty-aware policies—such as those leveraging Thompson Sampling—integrate principled probabilistic modeling, leading to fundamentally improved exploration-exploitation tradeoffs. This topic is exemplified in the setting of batch contextual bandits, where scalable approximations to Bayesian inference—Laplace Approximation, Monte Carlo Dropout, and Epistemic Neural Networks—can be layered atop LLMs to robustly estimate uncertainty and minimize cumulative regret (Felicioni et al., 2024).

1. Formalization of LLM-Based Contextual Bandits

A batch contextual bandit with text-based contexts is characterized as follows. At each round t=1,,Tt=1,\ldots,T, the agent observes a batch of BB contexts xt1,,xtBXx_t^1,\ldots,x_t^B\in\mathcal X, each presented as a natural language input. For each context xtbx_t^b, the agent selects an action atbAa_t^b\in\mathcal A, where A=K|\mathcal A|=K, and receives a real-valued reward rtbRr_t^b\in\mathbb R. The expected reward is modeled via a parametric function fθ:X×ARf_\theta:\mathcal X\times\mathcal A\rightarrow\mathbb R:

r=fθ(x,a)+ϵ,ϵN(0,σobs2),r = f_\theta(x, a) + \epsilon, \quad \epsilon\sim\mathcal N(0, \sigma_{\mathrm{obs}}^2),

so that E[rx,a]=fθ(x,a)\mathbb E[r|x,a]=f_\theta(x,a).

The objective is the minimization of average cumulative regret:

RT=1Tt=1T[rt1Bb=1Brtb],R_T = \frac{1}{T} \sum_{t=1}^T \left[r_t^* - \frac{1}{B} \sum_{b=1}^B r_t^b\right],

where rtr_t^* denotes the expected reward of the optimal policy on round tt (Felicioni et al., 2024).

2. Greedy LLM Bandit Baseline

The standard baseline employs a pre-trained LLM as a feature extractor π~θPT(x)Rd\tilde\pi_{\theta_{\mathrm{PT}}}(x)\in\mathbb R^d, feeding its output into a linear layer to produce fθ(x)=Linear(π~θPT(x))RKf_\theta(x) = \mathrm{Linear}(\tilde\pi_{\theta_{\mathrm{PT}}}(x))\in\mathbb R^K, so that fθ(x,a)f_\theta(x,a) is the aa-th coordinate. The greedy policy is thus:

agreedy(x)=argmaxaAfθ^(x,a),a_{\mathrm{greedy}}(x) = \arg\max_{a\in\mathcal A} f_{\hat\theta}(x,a),

where the parameter θ^\hat\theta is estimated by minimizing an 2\ell_2-regularized mean squared error (MSE) loss:

L(θ;Dt)=b=1B(rtbfθ(xtb,atb))2+λθθt122,L(\theta; \mathcal D_t) = \sum_{b=1}^B (r_t^b - f_\theta(x_t^b, a_t^b))^2 + \lambda \|\theta - \theta_{t-1}\|_2^2,

with regularization λ=σobs2/σp2\lambda=\sigma_{\mathrm{obs}}^2/\sigma_p^2 enforcing a Gaussian prior of variance σp2\sigma_p^2 on θ\theta centered at θt1\theta_{t-1} (Felicioni et al., 2024).

3. Scalable Uncertainty Estimation Methods

Uncertainty modeling is a prerequisite for effective exploration in contextual bandits and is requisite for sampling-based policies such as Thompson Sampling. Three mechanisms for approximate posterior inference in LLM-based bandits are examined:

3.1 Laplace Approximation (LA)

Laplace Approximation takes a Bayesian perspective, constructing a local Gaussian approximation to the posterior:

  • Compute θMAP=argminθL(θ;D)\theta_{\mathrm{MAP}} = \arg\min_\theta L(\theta; \mathcal D).
  • Approximate negative log posterior by a second-order Taylor expansion around θMAP\theta_{\mathrm{MAP}}:

L(θ;D)L(θMAP;D)+12(θθMAP)H(θθMAP),H=θ2L(θ;D)θMAPL(\theta; \mathcal D) \approx L(\theta_{\mathrm{MAP}}; \mathcal D) + \frac{1}{2} (\theta - \theta_{\mathrm{MAP}})^\top H (\theta - \theta_{\mathrm{MAP}}), \quad H=\nabla^2_\theta L(\theta; \mathcal D)|_{\theta_{\mathrm{MAP}}}

  • Yields p(θD)N(θMAP,H1)p(\theta|\mathcal D) \approx \mathcal N(\theta_{\mathrm{MAP}}, H^{-1}).

For LLMs, scalable computation is obtained via:

  • Recursive Hessian updates: H(1:t)=Hl(t)+H(1:t1)+HpH^{(1:t)} = H_l^{(t)} + H^{(1:t-1)} + H_p.
  • Fisher-Hessian approximation: use diagonal Hl=diag(data(θfθ(x,a))2/σobs2)H_l = \text{diag}\left(\sum_{\text{data}} (\nabla_\theta f_\theta(x, a))^2/\sigma_{\mathrm{obs}}^2\right).
  • Last-layer LA: compute HH only for the linear output layer, keeping the LLM backbone fixed during posterior sampling (Felicioni et al., 2024).

3.2 Monte Carlo Dropout

Applying dropout at inference is interpreted as sampling from a variational posterior q(θ)q(\theta):

  • At each Thompson Sampling decision, sample a dropout mask pdropp_\mathrm{drop} across the network, yielding a thinned parameter θ^\hat\theta.
  • Estimate predictive uncertainty by aggregating Monte Carlo samples:

p(rx,a,D)1Mi=1MN(r;fθ(i)(x,a),σobs2)p(r|x,a, \mathcal D) \approx \frac{1}{M}\sum_{i=1}^M \mathcal N(r; f_{\theta^{(i)}}(x,a), \sigma_{\mathrm{obs}}^2)

This mechanism implements efficient, scalable uncertainty quantification with minimal modifications to pre-trained LLMs (Felicioni et al., 2024).

3.3 Epistemic Neural Networks (Epinets)

Epinets integrate an auxiliary network epiηepi_\eta that takes as input the extracted base features and a random "epistemic index" zPZz\sim P_Z:

gθ,η(x,a;z)=fθ(x,a)+epiη([sg(π~θ(x)),z])zg_{\theta,\eta}(x,a;z) = f_\theta(x,a) + epi_\eta([\mathrm{sg}(\tilde\pi_\theta(x)), z])^\top z

The training objective is:

L(θ,η;D)=(rgθ,η(x,a;z))2+λθθprev22,L(\theta, \eta; \mathcal D) = \sum (r - g_{\theta,\eta}(x,a;z))^2 + \lambda \|\theta-\theta_{\mathrm{prev}}\|_2^2,

with zz re-sampled per data point. Inference samples zz to mimic Thompson Sampling, even in absence of explicit Bayesian formulation (Felicioni et al., 2024).

4. Thompson Sampling with Uncertainty Estimates

Thompson Sampling (TS) with parameter uncertainty is executed per round:

  1. Observe batch xt1,...,xtBx_t^1, ..., x_t^B.
  2. Sample parameters θ~p(θDt1)\tilde\theta\sim p(\theta|\mathcal D_{t-1}) using one of the aforementioned approximations.
  3. For each bb, select atb=argmaxafθ~(xtb,a)a_t^b = \arg\max_a f_{\tilde\theta}(x_t^b,a).
  4. Observe rewards rtbr_t^b, augment dataset, and update posterior.

The process is formalized as:

1
2
3
4
5
6
7
8
9
10
11
12
13
\begin{algorithmic}[1]
  \Require prior %%%%47%%%%
  \For{%%%%48%%%%}
    \State observe contexts %%%%49%%%%
    \State sample %%%%50%%%%
    \For{%%%%51%%%%}
      \State %%%%52%%%%
      \State observe reward %%%%53%%%%
    \EndFor
    \State update %%%%54%%%%
    \State update posterior approximation %%%%55%%%%
  \EndFor
\end{algorithmic}
(Felicioni et al., 2024)

5. Empirical Evaluation and Key Findings

Empirical evaluation utilizes the "Measuring Hate Speech" dataset (≈136k comments) with each round presenting B=32B=32 comments. Actions are "publish" or "not-publish" with reward assignments: publishing a non-toxic comment yields +1+1, publishing a toxic comment yields 0.5-0.5, and not-publishing any comment yields +0.5+0.5.

Protocols employ T=100T=100 rounds and 20 random seeds, with fine-tuning for 50 epochs per round using Adam (lr=3×105\text{lr}=3\times10^{-5}). Models assessed include:

  • Greedy (no uncertainty estimation)
  • TS with Dropout
  • TS with diagonal Fisher Laplace Approximation ("Diag. LA")
  • TS with last-layer full Laplace Approximation ("Last LA")
  • TS with Epinet

Key findings:

  • All TS variants, regardless of uncertainty estimation strategy, achieve substantially lower average regret than greedy.
  • TS methods exhibit tighter confidence intervals; greedy incurs higher variance and in some seeds nearly constant regret.
  • CDF analysis over random seeds shows TS dominates greedy both in worst-case and average performance.
  • Action-selection ratio curves demonstrate greedy's under-exploration relative to TS methods, which achieve more balanced action distribution.

An additional IMDb dataset experiment confirms the qualitative pattern: epistemic uncertainty improves exploration and regret minimization (Felicioni et al., 2024).

6. Theoretical Insights and Practical Implications

Thompson Sampling, when the posterior is exact, attains near-optimal regret bounds, e.g., RT=O(T)R_T=O(\sqrt{T}) for linear bandits (Agrawal & Goyal 2017). Even approximate posteriors that capture core epistemic uncertainty can result in significantly improved empirical exploration compared to the greedy policy. Epistemic uncertainty drives exploration toward actions with high reward uncertainty; in contrast, deterministic greedy strategies risk premature convergence on suboptimal arms (Felicioni et al., 2024).

A plausible implication is that scalable, approximate uncertainty estimation can be deployed in LLM-based decision-making tasks with minimal incremental cost and substantial improvement in online performance. Even simple MC-Dropout, without re-tuning dropout probability, is highly competitive. More sophisticated methods (e.g., Laplace Approximation, Epinet) can further enhance robustness, but the primary benefit arises from incorporating any epistemic uncertainty into the decision policy.

7. Conclusions

Fine-tuning LLMs for reward prediction while neglecting epistemic uncertainty leads to unreliable and suboptimal decision-making in contextual bandit settings. Incorporating scalable uncertainty estimates—via Dropout, Laplace Approximation, or Epinet—within Thompson Sampling dramatically reduces regret and ensures more robust exploration. The principal finding is that modeling epistemic uncertainty is not optional, but fundamental for safe and efficient online decision-making in LLM-driven bandit problems (Felicioni et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Uncertainty Estimation in LLM-Based Contextual Bandits.