The paper "Enough Coin Flips Can Make LLMs Act Bayesian" explores the capacity of LLMs to perform structured reasoning consistent with a Bayesian framework through in-context learning (ICL). The paper uses biased coin flips as a controlled setting to assess whether LLMs update their priors in a Bayesian manner or rely on pattern matching.
The authors investigate the following:
- Whether LLMs possess biased priors.
- How in-context evidence and explicit bias instructions are utilized.
- If LLMs follow Bayesian posterior updates.
- The effect of attention magnitude on Bayesian inference.
The key findings include:
- LLMs often exhibit biased priors, leading to initial divergence in zero-shot settings.
- In-context evidence outweighs explicit bias instructions.
- LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors.
- Attention magnitude has a negligible effect on Bayesian inference.
The authors introduce the concept of modeling coin flips as Bayesian processes, where the probability of obtaining heads, , is treated as a random variable with a binomial distribution. The likelihood of observed data, , given heads in independent coin flips, is represented as:
- is the likelihood of the observed data given
- is the probability of obtaining heads
- is the number of heads observed
- is the number of coin flips
A Beta distribution is used as a prior distribution for :
- is the prior distribution of
- and are the parameters of the Beta distribution
- is the Beta function
The posterior distribution is proportional to the product of the likelihood and the prior:
- is the posterior distribution
- is the likelihood
- is the prior distribution
- is the number of heads observed
- is the number of coin flips
- and are the parameters of the Beta distribution
The posterior distribution for is also a Beta distribution:
- is the posterior distribution of given the data
- and are the parameters of the Beta distribution
- is the number of heads observed
- is the number of coin flips
The authors measure the total variation distance (TVD) between the true posterior distribution and the normalized model distribution over the support :
- is the total variation distance
- is the true posterior distribution
- is the normalized model distribution
- is the support set of possible outcomes
The authors evaluated models such as Gemma-2, Phi-2/Phi-3.5 (mini), Llama-3.1 (8B), Mistral 7B, and OLMoE (7B), alongside their instruction-tuned variants, and Pythia Scaling Suite.
The paper found that all evaluated LLMs exhibit a heads-biased prior for . Non-instruct LLMs often ignore explicit biasing instructions, while instruct models show slightly better performance, especially at extreme bias values (0% and 100% heads biases). Scaling the model size does not substantially change the LLMs' priors or improve performance in modeling expected biased distributions.
The paper investigates how LLMs incorporate in-context evidence into their posteriors. The results show that models converge to the expected distribution as more evidence is provided via in-context learning. Model scale has a negligible impact on ICL benefits, with models performing comparably across scales.
To explore if models perform pure Bayesian updates, the authors model a generative process containing 100 steps, where the first 50 samples are drawn and the second 50 samples are drawn , where and . They introduce an exponential decay factor to modify the updates:
- and are the parameters of the Beta distribution
- is the exponential decay factor
- indicates the latest result is heads
- indicates the latest result is tails
The posterior mean is:
- is the posterior mean
- and are the parameters of the Beta distribution
The paper found that using a , implying a shortened time horizon, fits the behavior almost perfectly in some cases, empirically suggesting that models are performing local Bayesian updates with a slight discount factor. The value of is notably different for each model, suggesting that models have architecture-specific time-horizon behavior. Instruction-tuned models generally have much lower values than their non-instruction-tuned counterparts, implying they are more local when performing ICL.
The paper found that the amount of attention paid to any segment is generally uncorrelated with the overall quality of the point estimate, suggesting that the total magnitude of the attention paid to each segment does not dramatically impact model quality. Models pay relatively more attention to data which is more likely to lead to changes in the final distribution, but higher/lower attention is somewhat uncorrelated with final model quality.
The authors conclude that although LLMs typically begin with biases that deviate from real-world frequencies, they can approximate Bayesian updating once they see enough in-context evidence. This suggests that the primary limitation in simulating stochastic processes arises from poor priors, not from a failure of ICL itself. Accurate simulation in complex domains relies heavily on well-calibrated base assumptions, and without calibration or sufficient prompting, LLMs may misrepresent even simple coin-flip dynamics.