Enough Coin Flips Can Make LLMs Act Bayesian (2503.04722v1)

Published 6 Mar 2025 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs exhibit the ability to generalize given few-shot examples in their input prompt, an emergent capability known as in-context learning (ICL). We investigate whether LLMs utilize ICL to perform structured reasoning in ways that are consistent with a Bayesian framework or rely on pattern matching. Using a controlled setting of biased coin flips, we find that: (1) LLMs often possess biased priors, causing initial divergence in zero-shot settings, (2) in-context evidence outweighs explicit bias instructions, (3) LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors rather than flawed updates, and (4) attention magnitude has negligible effect on Bayesian inference. With sufficient demonstrations of biased coin flips via ICL, LLMs update their priors in a Bayesian manner.

PDF Abstract

The paper "Enough Coin Flips Can Make LLMs Act Bayesian" explores the capacity of LLMs to perform structured reasoning consistent with a Bayesian framework through in-context learning (ICL). The paper uses biased coin flips as a controlled setting to assess whether LLMs update their priors in a Bayesian manner or rely on pattern matching.

The authors investigate the following:

Whether LLMs possess biased priors.
How in-context evidence and explicit bias instructions are utilized.
If LLMs follow Bayesian posterior updates.
The effect of attention magnitude on Bayesian inference.

The key findings include:

LLMs often exhibit biased priors, leading to initial divergence in zero-shot settings.
In-context evidence outweighs explicit bias instructions.
LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors.
Attention magnitude has a negligible effect on Bayesian inference.

The authors introduce the concept of modeling coin flips as Bayesian processes, where the probability of obtaining heads, $\theta$ , is treated as a random variable with a binomial distribution. The likelihood of observed data, $p(D|\theta)$ , given $k$ heads in $n$ independent coin flips, is represented as:

$p(D|\theta) = \theta^k (1-\theta)^{n-k}$

$p(D|\theta)$ is the likelihood of the observed data given $\theta$
$\theta$ is the probability of obtaining heads
$k$ is the number of heads observed
$n$ is the number of coin flips

A Beta distribution is used as a prior distribution for $\theta$ :

$p(\theta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}$

$p(\theta)$ is the prior distribution of $\theta$
$\alpha$ and $\beta$ are the parameters of the Beta distribution
$B(\alpha,\beta)$ is the Beta function

The posterior distribution is proportional to the product of the likelihood and the prior:

$p(\theta|D) \propto p(D|\theta) p(\theta) \propto \theta^k (1-\theta)^{n-k} \cdot \theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{\alpha+k-1}(1-\theta)^{\beta+n-k-1}$

$p(\theta|D)$ is the posterior distribution
$p(D|\theta)$ is the likelihood
$p(\theta)$ is the prior distribution
$k$ is the number of heads observed
$n$ is the number of coin flips
$\alpha$ and $\beta$ are the parameters of the Beta distribution

The posterior distribution for $\theta$ is also a Beta distribution:

$\theta|D \sim \text{Beta}(\alpha+k, \beta+n-k)$

$\theta|D$ is the posterior distribution of $\theta$ given the data
$\alpha$ and $\beta$ are the parameters of the Beta distribution
$k$ is the number of heads observed
$n$ is the number of coin flips

The authors measure the total variation distance (TVD) between the true posterior distribution $p^*(o)$ and the normalized model distribution $\hat{p}_{\mathcal{M}(o)}$ over the support $\Omega$ :

$\delta(p^*,\hat{p}_{\mathcal{M}}) = \frac{1}{2}\sum_{o \in \Omega} \left|p^*(o) - \hat{p}_{\mathcal{M}(o)}\right|$

$\delta(p^*,\hat{p}_{\mathcal{M}})$ is the total variation distance
$p^*(o)$ is the true posterior distribution
$\hat{p}_{\mathcal{M}(o)}$ is the normalized model distribution
$\Omega$ is the support set of possible outcomes

The authors evaluated models such as Gemma-2, Phi-2/Phi-3.5 (mini), Llama-3.1 (8B), Mistral 7B, and OLMoE (7B), alongside their instruction-tuned variants, and Pythia Scaling Suite.

The paper found that all evaluated LLMs exhibit a heads-biased prior for $\theta$ . Non-instruct LLMs often ignore explicit biasing instructions, while instruct models show slightly better performance, especially at extreme bias values (0% and 100% heads biases). Scaling the model size does not substantially change the LLMs' priors or improve performance in modeling expected biased distributions.

The paper investigates how LLMs incorporate in-context evidence into their posteriors. The results show that models converge to the expected distribution as more evidence is provided via in-context learning. Model scale has a negligible impact on ICL benefits, with models performing comparably across scales.

To explore if models perform pure Bayesian updates, the authors model a generative process containing 100 steps, where the first 50 samples are drawn $\sim Bernoulli(\theta_1)$ and the second 50 samples are drawn $\sim Bernoulli(\theta_2)$ , where $\theta_1=0.75$ and $\theta_2=0.25$ . They introduce an exponential decay factor $\gamma$ to modify the updates:

$\alpha \leftarrow \gamma \alpha + I(H), \quad \beta \leftarrow \gamma \beta + I(T)$

$\alpha$ and $\beta$ are the parameters of the Beta distribution
$\gamma$ is the exponential decay factor
$I(H)$ indicates the latest result is heads
$I(T)$ indicates the latest result is tails

The posterior mean is:

$\mathbb{E}[p] = \frac{\alpha}{\alpha + \beta}$

$\mathbb{E}[p]$ is the posterior mean
$\alpha$ and $\beta$ are the parameters of the Beta distribution

The paper found that using a $\gamma < 1$ , implying a shortened time horizon, fits the behavior almost perfectly in some cases, empirically suggesting that models are performing local Bayesian updates with a slight discount factor. The value of $\gamma$ is notably different for each model, suggesting that models have architecture-specific time-horizon behavior. Instruction-tuned models generally have much lower $\gamma$ values than their non-instruction-tuned counterparts, implying they are more local when performing ICL.

The paper found that the amount of attention paid to any segment is generally uncorrelated with the overall quality of the point estimate, suggesting that the total magnitude of the attention paid to each segment does not dramatically impact model quality. Models pay relatively more attention to data which is more likely to lead to changes in the final distribution, but higher/lower attention is somewhat uncorrelated with final model quality.

The authors conclude that although LLMs typically begin with biases that deviate from real-world frequencies, they can approximate Bayesian updating once they see enough in-context evidence. This suggests that the primary limitation in simulating stochastic processes arises from poor priors, not from a failure of ICL itself. Accurate simulation in complex domains relies heavily on well-calibrated base assumptions, and without calibration or sufficient prompting, LLMs may misrepresent even simple coin-flip dynamics.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Ritwik Gupta (23 papers)
Rodolfo Corona (9 papers)
Jiaxin Ge (14 papers)
Eric Wang (34 papers)
Dan Klein (99 papers)
Trevor Darrell (324 papers)
David M. Chan (30 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/aomaru_21490/status/1923449260722241774

https://twitter.com/fly51fly/status/1898132818242502681

https://twitter.com/Psyche_OS/status/1899077347233652942

https://twitter.com/GptMaestro/status/1898446090531471801

https://twitter.com/EricModesittXs/status/1899065658568225230

YouTube

Show All Videos