Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enough Coin Flips Can Make LLMs Act Bayesian (2503.04722v1)

Published 6 Mar 2025 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs exhibit the ability to generalize given few-shot examples in their input prompt, an emergent capability known as in-context learning (ICL). We investigate whether LLMs utilize ICL to perform structured reasoning in ways that are consistent with a Bayesian framework or rely on pattern matching. Using a controlled setting of biased coin flips, we find that: (1) LLMs often possess biased priors, causing initial divergence in zero-shot settings, (2) in-context evidence outweighs explicit bias instructions, (3) LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors rather than flawed updates, and (4) attention magnitude has negligible effect on Bayesian inference. With sufficient demonstrations of biased coin flips via ICL, LLMs update their priors in a Bayesian manner.

The paper "Enough Coin Flips Can Make LLMs Act Bayesian" explores the capacity of LLMs to perform structured reasoning consistent with a Bayesian framework through in-context learning (ICL). The paper uses biased coin flips as a controlled setting to assess whether LLMs update their priors in a Bayesian manner or rely on pattern matching.

The authors investigate the following:

  • Whether LLMs possess biased priors.
  • How in-context evidence and explicit bias instructions are utilized.
  • If LLMs follow Bayesian posterior updates.
  • The effect of attention magnitude on Bayesian inference.

The key findings include:

  1. LLMs often exhibit biased priors, leading to initial divergence in zero-shot settings.
  2. In-context evidence outweighs explicit bias instructions.
  3. LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors.
  4. Attention magnitude has a negligible effect on Bayesian inference.

The authors introduce the concept of modeling coin flips as Bayesian processes, where the probability of obtaining heads, θ\theta, is treated as a random variable with a binomial distribution. The likelihood of observed data, p(Dθ)p(D|\theta), given kk heads in nn independent coin flips, is represented as:

p(Dθ)=θk(1θ)nkp(D|\theta) = \theta^k (1-\theta)^{n-k}

  • p(Dθ)p(D|\theta) is the likelihood of the observed data given θ\theta
  • θ\theta is the probability of obtaining heads
  • kk is the number of heads observed
  • nn is the number of coin flips

A Beta distribution is used as a prior distribution for θ\theta:

p(θ)=θα1(1θ)β1B(α,β)p(\theta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}

  • p(θ)p(\theta) is the prior distribution of θ\theta
  • α\alpha and β\beta are the parameters of the Beta distribution
  • B(α,β)B(\alpha,\beta) is the Beta function

The posterior distribution is proportional to the product of the likelihood and the prior:

p(θD)p(Dθ)p(θ)θk(1θ)nkθα1(1θ)β1=θα+k1(1θ)β+nk1p(\theta|D) \propto p(D|\theta) p(\theta) \propto \theta^k (1-\theta)^{n-k} \cdot \theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{\alpha+k-1}(1-\theta)^{\beta+n-k-1}

  • p(θD)p(\theta|D) is the posterior distribution
  • p(Dθ)p(D|\theta) is the likelihood
  • p(θ)p(\theta) is the prior distribution
  • kk is the number of heads observed
  • nn is the number of coin flips
  • α\alpha and β\beta are the parameters of the Beta distribution

The posterior distribution for θ\theta is also a Beta distribution:

θDBeta(α+k,β+nk)\theta|D \sim \text{Beta}(\alpha+k, \beta+n-k)

  • θD\theta|D is the posterior distribution of θ\theta given the data
  • α\alpha and β\beta are the parameters of the Beta distribution
  • kk is the number of heads observed
  • nn is the number of coin flips

The authors measure the total variation distance (TVD) between the true posterior distribution p(o)p^*(o) and the normalized model distribution p^M(o)\hat{p}_{\mathcal{M}(o)} over the support Ω\Omega:

δ(p,p^M)=12oΩp(o)p^M(o)\delta(p^*,\hat{p}_{\mathcal{M}}) = \frac{1}{2}\sum_{o \in \Omega} \left|p^*(o) - \hat{p}_{\mathcal{M}(o)}\right|

  • δ(p,p^M)\delta(p^*,\hat{p}_{\mathcal{M}}) is the total variation distance
  • p(o)p^*(o) is the true posterior distribution
  • p^M(o)\hat{p}_{\mathcal{M}(o)} is the normalized model distribution
  • Ω\Omega is the support set of possible outcomes

The authors evaluated models such as Gemma-2, Phi-2/Phi-3.5 (mini), Llama-3.1 (8B), Mistral 7B, and OLMoE (7B), alongside their instruction-tuned variants, and Pythia Scaling Suite.

The paper found that all evaluated LLMs exhibit a heads-biased prior for θ\theta. Non-instruct LLMs often ignore explicit biasing instructions, while instruct models show slightly better performance, especially at extreme bias values (0% and 100% heads biases). Scaling the model size does not substantially change the LLMs' priors or improve performance in modeling expected biased distributions.

The paper investigates how LLMs incorporate in-context evidence into their posteriors. The results show that models converge to the expected distribution as more evidence is provided via in-context learning. Model scale has a negligible impact on ICL benefits, with models performing comparably across scales.

To explore if models perform pure Bayesian updates, the authors model a generative process containing 100 steps, where the first 50 samples are drawn Bernoulli(θ1)\sim Bernoulli(\theta_1) and the second 50 samples are drawn Bernoulli(θ2)\sim Bernoulli(\theta_2), where θ1=0.75\theta_1=0.75 and θ2=0.25\theta_2=0.25. They introduce an exponential decay factor γ\gamma to modify the updates:

αγα+I(H),βγβ+I(T)\alpha \leftarrow \gamma \alpha + I(H), \quad \beta \leftarrow \gamma \beta + I(T)

  • α\alpha and β\beta are the parameters of the Beta distribution
  • γ\gamma is the exponential decay factor
  • I(H)I(H) indicates the latest result is heads
  • I(T)I(T) indicates the latest result is tails

The posterior mean is:

E[p]=αα+β\mathbb{E}[p] = \frac{\alpha}{\alpha + \beta}

  • E[p]\mathbb{E}[p] is the posterior mean
  • α\alpha and β\beta are the parameters of the Beta distribution

The paper found that using a γ<1\gamma < 1, implying a shortened time horizon, fits the behavior almost perfectly in some cases, empirically suggesting that models are performing local Bayesian updates with a slight discount factor. The value of γ\gamma is notably different for each model, suggesting that models have architecture-specific time-horizon behavior. Instruction-tuned models generally have much lower γ\gamma values than their non-instruction-tuned counterparts, implying they are more local when performing ICL.

The paper found that the amount of attention paid to any segment is generally uncorrelated with the overall quality of the point estimate, suggesting that the total magnitude of the attention paid to each segment does not dramatically impact model quality. Models pay relatively more attention to data which is more likely to lead to changes in the final distribution, but higher/lower attention is somewhat uncorrelated with final model quality.

The authors conclude that although LLMs typically begin with biases that deviate from real-world frequencies, they can approximate Bayesian updating once they see enough in-context evidence. This suggests that the primary limitation in simulating stochastic processes arises from poor priors, not from a failure of ICL itself. Accurate simulation in complex domains relies heavily on well-calibrated base assumptions, and without calibration or sufficient prompting, LLMs may misrepresent even simple coin-flip dynamics.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ritwik Gupta (23 papers)
  2. Rodolfo Corona (9 papers)
  3. Jiaxin Ge (14 papers)
  4. Eric Wang (34 papers)
  5. Dan Klein (99 papers)
  6. Trevor Darrell (324 papers)
  7. David M. Chan (30 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com