Papers
Topics
Authors
Recent
Search
2000 character limit reached

Uncertainty Estimation in LLM-Based Bandits

Updated 18 December 2025
  • The paper introduces uncertainty estimation methods, including Laplace Approximation, MC Dropout, and Epinet, to enhance exploration in LLM-based contextual bandits.
  • It details scalable Bayesian inference techniques, such as diagonal and last-layer approximations, to robustly quantify epistemic uncertainty and minimize cumulative regret.
  • Empirical evaluations on datasets like Hate Speech and IMDb show that uncertainty-aware Thompson Sampling significantly outperforms greedy policies in balancing exploration and exploitation.

Uncertainty estimation in LLM-based contextual bandits addresses the critical challenge of quantifying epistemic uncertainty in sequential decision-making tasks where context is provided in natural language. While recent LLM-based agents commonly rely on greedy reward prediction, uncertainty-aware policies—such as those leveraging Thompson Sampling—integrate principled probabilistic modeling, leading to fundamentally improved exploration-exploitation tradeoffs. This topic is exemplified in the setting of batch contextual bandits, where scalable approximations to Bayesian inference—Laplace Approximation, Monte Carlo Dropout, and Epistemic Neural Networks—can be layered atop LLMs to robustly estimate uncertainty and minimize cumulative regret (Felicioni et al., 2024).

1. Formalization of LLM-Based Contextual Bandits

A batch contextual bandit with text-based contexts is characterized as follows. At each round t=1,…,Tt=1,\ldots,T, the agent observes a batch of BB contexts xt1,…,xtB∈Xx_t^1,\ldots,x_t^B\in\mathcal X, each presented as a natural language input. For each context xtbx_t^b, the agent selects an action atb∈Aa_t^b\in\mathcal A, where ∣A∣=K|\mathcal A|=K, and receives a real-valued reward rtb∈Rr_t^b\in\mathbb R. The expected reward is modeled via a parametric function fθ:X×A→Rf_\theta:\mathcal X\times\mathcal A\rightarrow\mathbb R:

r=fθ(x,a)+ϵ,ϵ∼N(0,σobs2),r = f_\theta(x, a) + \epsilon, \quad \epsilon\sim\mathcal N(0, \sigma_{\mathrm{obs}}^2),

so that E[r∣x,a]=fθ(x,a)\mathbb E[r|x,a]=f_\theta(x,a).

The objective is the minimization of average cumulative regret:

BB0

where BB1 denotes the expected reward of the optimal policy on round BB2 (Felicioni et al., 2024).

2. Greedy LLM Bandit Baseline

The standard baseline employs a pre-trained LLM as a feature extractor BB3, feeding its output into a linear layer to produce BB4, so that BB5 is the BB6-th coordinate. The greedy policy is thus:

BB7

where the parameter BB8 is estimated by minimizing an BB9-regularized mean squared error (MSE) loss:

xt1,…,xtB∈Xx_t^1,\ldots,x_t^B\in\mathcal X0

with regularization xt1,…,xtB∈Xx_t^1,\ldots,x_t^B\in\mathcal X1 enforcing a Gaussian prior of variance xt1,…,xtB∈Xx_t^1,\ldots,x_t^B\in\mathcal X2 on xt1,…,xtB∈Xx_t^1,\ldots,x_t^B\in\mathcal X3 centered at xt1,…,xtB∈Xx_t^1,\ldots,x_t^B\in\mathcal X4 (Felicioni et al., 2024).

3. Scalable Uncertainty Estimation Methods

Uncertainty modeling is a prerequisite for effective exploration in contextual bandits and is requisite for sampling-based policies such as Thompson Sampling. Three mechanisms for approximate posterior inference in LLM-based bandits are examined:

3.1 Laplace Approximation (LA)

Laplace Approximation takes a Bayesian perspective, constructing a local Gaussian approximation to the posterior:

  • Compute xt1,…,xtB∈Xx_t^1,\ldots,x_t^B\in\mathcal X5.
  • Approximate negative log posterior by a second-order Taylor expansion around xt1,…,xtB∈Xx_t^1,\ldots,x_t^B\in\mathcal X6:

xt1,…,xtB∈Xx_t^1,\ldots,x_t^B\in\mathcal X7

  • Yields xt1,…,xtB∈Xx_t^1,\ldots,x_t^B\in\mathcal X8.

For LLMs, scalable computation is obtained via:

  • Recursive Hessian updates: xt1,…,xtB∈Xx_t^1,\ldots,x_t^B\in\mathcal X9.
  • Fisher-Hessian approximation: use diagonal xtbx_t^b0.
  • Last-layer LA: compute xtbx_t^b1 only for the linear output layer, keeping the LLM backbone fixed during posterior sampling (Felicioni et al., 2024).

3.2 Monte Carlo Dropout

Applying dropout at inference is interpreted as sampling from a variational posterior xtbx_t^b2:

  • At each Thompson Sampling decision, sample a dropout mask xtbx_t^b3 across the network, yielding a thinned parameter xtbx_t^b4.
  • Estimate predictive uncertainty by aggregating Monte Carlo samples:

xtbx_t^b5

This mechanism implements efficient, scalable uncertainty quantification with minimal modifications to pre-trained LLMs (Felicioni et al., 2024).

3.3 Epistemic Neural Networks (Epinets)

Epinets integrate an auxiliary network xtbx_t^b6 that takes as input the extracted base features and a random "epistemic index" xtbx_t^b7:

xtbx_t^b8

The training objective is:

xtbx_t^b9

with atb∈Aa_t^b\in\mathcal A0 re-sampled per data point. Inference samples atb∈Aa_t^b\in\mathcal A1 to mimic Thompson Sampling, even in absence of explicit Bayesian formulation (Felicioni et al., 2024).

4. Thompson Sampling with Uncertainty Estimates

Thompson Sampling (TS) with parameter uncertainty is executed per round:

  1. Observe batch atb∈Aa_t^b\in\mathcal A2.
  2. Sample parameters atb∈Aa_t^b\in\mathcal A3 using one of the aforementioned approximations.
  3. For each atb∈Aa_t^b\in\mathcal A4, select atb∈Aa_t^b\in\mathcal A5.
  4. Observe rewards atb∈Aa_t^b\in\mathcal A6, augment dataset, and update posterior.

The process is formalized as: ∣A∣=K|\mathcal A|=K4 (Felicioni et al., 2024)

5. Empirical Evaluation and Key Findings

Empirical evaluation utilizes the "Measuring Hate Speech" dataset (≈136k comments) with each round presenting atb∈Aa_t^b\in\mathcal A7 comments. Actions are "publish" or "not-publish" with reward assignments: publishing a non-toxic comment yields atb∈Aa_t^b\in\mathcal A8, publishing a toxic comment yields atb∈Aa_t^b\in\mathcal A9, and not-publishing any comment yields ∣A∣=K|\mathcal A|=K0.

Protocols employ ∣A∣=K|\mathcal A|=K1 rounds and 20 random seeds, with fine-tuning for 50 epochs per round using Adam (∣A∣=K|\mathcal A|=K2). Models assessed include:

  • Greedy (no uncertainty estimation)
  • TS with Dropout
  • TS with diagonal Fisher Laplace Approximation ("Diag. LA")
  • TS with last-layer full Laplace Approximation ("Last LA")
  • TS with Epinet

Key findings:

  • All TS variants, regardless of uncertainty estimation strategy, achieve substantially lower average regret than greedy.
  • TS methods exhibit tighter confidence intervals; greedy incurs higher variance and in some seeds nearly constant regret.
  • CDF analysis over random seeds shows TS dominates greedy both in worst-case and average performance.
  • Action-selection ratio curves demonstrate greedy's under-exploration relative to TS methods, which achieve more balanced action distribution.

An additional IMDb dataset experiment confirms the qualitative pattern: epistemic uncertainty improves exploration and regret minimization (Felicioni et al., 2024).

6. Theoretical Insights and Practical Implications

Thompson Sampling, when the posterior is exact, attains near-optimal regret bounds, e.g., ∣A∣=K|\mathcal A|=K3 for linear bandits (Agrawal & Goyal 2017). Even approximate posteriors that capture core epistemic uncertainty can result in significantly improved empirical exploration compared to the greedy policy. Epistemic uncertainty drives exploration toward actions with high reward uncertainty; in contrast, deterministic greedy strategies risk premature convergence on suboptimal arms (Felicioni et al., 2024).

A plausible implication is that scalable, approximate uncertainty estimation can be deployed in LLM-based decision-making tasks with minimal incremental cost and substantial improvement in online performance. Even simple MC-Dropout, without re-tuning dropout probability, is highly competitive. More sophisticated methods (e.g., Laplace Approximation, Epinet) can further enhance robustness, but the primary benefit arises from incorporating any epistemic uncertainty into the decision policy.

7. Conclusions

Fine-tuning LLMs for reward prediction while neglecting epistemic uncertainty leads to unreliable and suboptimal decision-making in contextual bandit settings. Incorporating scalable uncertainty estimates—via Dropout, Laplace Approximation, or Epinet—within Thompson Sampling dramatically reduces regret and ensures more robust exploration. The principal finding is that modeling epistemic uncertainty is not optional, but fundamental for safe and efficient online decision-making in LLM-driven bandit problems (Felicioni et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Uncertainty Estimation in LLM-Based Contextual Bandits.