Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Position: Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints (2503.01747v3)

Published 3 Mar 2025 in cs.AI, cs.LG, and stat.ML

Abstract: Rigorous statistical evaluations of LLMs, including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios. We provide a simple Python library for these Bayesian methods at https://github.com/sambowyer/bayes_evals .

The paper addresses the problem of uncertainty quantification in evaluating LLMs on benchmarks. It argues that while methods based on the Central Limit Theorem (CLT) are suitable for large benchmarks (thousands of examples), they are unreliable for smaller, more specialized benchmarks, which are becoming increasingly common. The authors demonstrate that CLT-based methods underestimate uncertainty in small-data settings and propose alternative frequentist and Bayesian methods. A Python library, `bayes_evals', is provided for implementing the Bayesian methods.

The authors highlight that many current LLM benchmarks contain large evaluation sets, focusing on tasks that LLMs have largely mastered. However, real-world applications often require more targeted benchmarks with high-quality labels, which are more expensive to create and thus involve fewer examples. The paper cites examples such as CUAD, FrontierMath, SWE Bench Verified, MLE Bench, and LiveBench to illustrate this trend. The authors also note that even large benchmarks like Big Bench are often broken down into smaller sub-tasks, violating the independent and identically distributed (IID) assumption required for the CLT.

To demonstrate the limitations of CLT-based confidence intervals, the authors conduct experiments with simulated data, comparing frequentist and Bayesian methods. They generate datasets with known parameter values and measure the coverage of different intervals, i.e., the proportion of times the intervals contain the true value. The results show that CLT-based methods produce unreliable intervals that fail to achieve their target confidence level, especially with smaller datasets.

The paper then discusses the Central Limit Theorem and its role in frequentist uncertainty quantification. The CLT states that for a sufficiently large sample size NN, the sampling distribution of the sample mean will be approximately normal, regardless of the distribution of the original population. Formally, if X1,,XNX_1, \dots, X_N are IID random variables with mean μ\mu and finite variance σ2\sigma^2, then

N(μ^μ)d(0,σ2)\sqrt{N} (\hat{\mu} - \mu) \xrightarrow{d} \left( 0, \sigma^2 \right) as NN \rightarrow \infty,

where μ^=1Ni=1NXi\hat{\mu} = \frac{1}{N}\sum_{i=1}^N X_i is the sample mean.

  • μ^\hat{\mu} is the sample mean
  • μ\mu is the population mean
  • NN is the sample size
  • σ2\sigma^2 is the population variance

In practice, the population variance σ2\sigma^2 is rarely known and is estimated empirically by the sample variance, S2=1N1i=1N(Xiμ^)2S^2 = \frac{1}{N-1}\sum_{i=1}^N (X_i - \hat{\mu})^2. The standard error (SE) of the sample mean is given by $\text{SE}(\hat{\mu}) = \sqrt{\nicefrac{S^2}{N}}$.

For a desired confidence level 1α1-\alpha, the general form of a two-sided CLT-based confidence interval for the parameter of interest μ\mu is written as

$\text{CI}_{1-\alpha}(\mu) = \hat{\mu} \pm z_{\nicefrac{\alpha}{2}}\text{SE}(\hat{\mu})$,

  • CI1α(μ)\text{CI}_{1-\alpha}(\mu) is the confidence interval for μ\mu with confidence level 1α1-\alpha
  • μ^\hat{\mu} is the sample mean
  • $z_{\nicefrac{\alpha}{2}}$ is the $\nicefrac{\alpha}{2}$-th quantile of the standard normal distribution
  • SE(μ^)\text{SE}(\hat{\mu}) is the standard error of the sample mean

When comparing two independent samples with sizes NAN_A and NBN_B and sample variances SAS_A and SBS_B, the standard error of the difference in sample means is $\text{SE}(\hat{\mu}_A - \hat{\mu}_B) = \sqrt{\nicefrac{S_A^2}{N_A} + \nicefrac{S_B^2}{N_B}}$.

For paired samples, where each observation in sample AA is matched with one in sample BB, the CLT applies directly to the differences Di=XA,iXB,iD_i = X_{A,i} - X_{B,i}, which are treated as a single sample.

The paper provides several examples of the failures of CLT-based confidence intervals in LLM evaluations. First, it shows that CLT-based confidence intervals fail in the IID questions setting. It is common to assume that benchmarks contain IID questions, so that each eval outcome is a Bernoulli trial with some underlying probability of success θ\theta. However, as the empirical mean θ^\hat{\theta} approaches 0 or 1, the confidence interval shrinks towards 0, incorrectly suggesting certainty.

The authors evaluate the coverage of different types of intervals by generating data from

$\theta \sim \betadist(1, 1)$,

$y_i \sim \bernoulli(\theta)$ for i=1,Ni=1,\dots N.

They find that both CLT-based and bootstrap confidence intervals show poor calibration, with actual coverage well below the nominal 1α1-\alpha level. In contrast, Bayesian methods based on a Beta-Bernoulli distribution and the frequentist Wilson Score interval perform well. The posterior is available in closed form:

$P(\theta|y_{1:N}) = \betadist(1+\sum_{i=1}^N y_i, 1 + \sum_{i=1}^N (1-y_i))$.

Then, the paper discusses the failure of CLT-based confidence intervals in the clustered questions setting, such as reading comprehension benchmarks with multiple questions about a single passage of text. To address this, the authors suggest using clustered standard errors, a post-hoc adjustment to account for the correlation among the TT clusters with NtN_t questions each.

To assess the effectiveness of this approach, the authors use data from the following generative model:

$d \sim \gammadist(1, 1)$,

$\theta \sim \betadist(1, 1)$

$\theta_t \sim \betadist(d \theta, d (1-\theta))$,

$y_{i,t} \sim \bernoulli(\theta_t)$.

  • dd controls the range of difficulties of the tasks or clusters
  • θ\theta is the global performance of the model
  • θt\theta_t is the performance on a given task

They perform inference using importance sampling with the prior as proposal. The results show that only the Bayesian method based on a clustered model achieves the right coverage across different sample sizes.

The paper also examines the failure of CLT-based confidence intervals in the independent model comparison setting. They consider two LLMs, AA and BB, with true probabilities of success θA\theta_A and θB\theta_B. They construct confidence intervals on the difference in performances, Diff=θAθB\operatorname{Diff} = \theta_A - \theta_B, and the odds ratio, OR=θA/(1θA)θB/(1θB)\operatorname{OR} = \frac{\theta_A / (1 - \theta_A)}{\theta_B / (1 - \theta_B)}. The CLT can only construct a confidence interval on the difference in performances, since the odds ratio is a non-linear transformation of the parameters. The results show that the CLT-based interval has coverage far below the target level when NN is small. For odds ratio analysis, the standard frequentist method involves inverting the Fisher's exact test (FET), which tends to be overly conservative for small NN. The Bayesian approach is able to give credible intervals for both the difference and the odds ratio.

The authors highlight a key benefit of the Bayesian approach to model comparison: the ability to compute probabilities that one model outperforms another:

$\mathbb{P}(\theta_A > \theta_B | y_{A;1:N}, y_{B;1:N}) = \frac{1}{K} \sum_{k=1}^K \mathbbm{1}[\theta_A^{(k)} > \theta_B^{(k)}]$.

The paper also addresses the failure of CLT-based confidence intervals in paired model comparison settings, where two models have been evaluated on the same set of questions. To simulate paired evals data, the authors sample probabilities of success for each model, along with a correlation ρ\rho, and then sample NN points from a bivariate Gaussian:

(a1,b1),,(aN,bN)N((Φ1(θA) Φ1(θB)),(1ρ ρ1))(a_1,b_1), \ldots, (a_N, b_N) \sim \mathcal{N}\left(\begin{pmatrix} \Phi^{-1}(\theta_A) \ \Phi^{-1}(\theta_B)\end{pmatrix}, \begin{pmatrix}1 & \rho \ \rho & 1\end{pmatrix} \right),

where Φ()\Phi(\cdot) is the standard univariate Gaussian cumulative distribution function (CDF).

The authors perform Bayesian inference on the posterior distribution of θAθB\theta_A - \theta_B using importance sampling. The results show that all non-Bayesian methods severely underperform when it comes to achieving nominal coverage for small NN.

Finally, the paper discusses the failure of CLT-based confidence intervals when metrics are not averages of IID variables. Many metrics for LLM evaluations, such as FβF_\beta-scores, Matthews correlation coefficient (MCC) or G-score, are non-linear in the parameters, so the CLT is not applicable. As an example, consider the F1F_1 score:

$F_1 = 2\,\frac{\precision\cdot\recall}{\precision+\recall}$,

where $\precision = \frac{N_\text{TP}}{N_\text{TP} + N_\text{FP}}$ and $\recall = \frac{N_\text{TP}}{N_\text{TP} + N_\text{FN}}$.

The authors simulate an evaluation dataset by sampling ground truth parameters from a uniform Dirichlet prior, which is conjugate to the categorical likelihood:

$\bm{\theta} \sim \dirichlet(1, 1, 1, 1)$,

$y_i \sim \categorical(\bm{\theta})$,

$\bm{\theta} | y_{1:N} \sim \dirichlet(1 + N_\text{conf})$.

  • θ\bm{\theta} is the parameter vector
  • yiy_i is the outcome of the ii-th trial
  • NconfN_\text{conf} is the vector of counts in the confusion matrix

The authors compare Bayesian credible intervals against the bootstrap and find that the Bayesian intervals closely track the nominal coverage, while the bootstrap ones systematically under-cover.

In conclusion, the paper recommends against using the CLT to construct confidence intervals for LLM evaluations, arguing that the assumptions are rarely satisfied. They suggest using more appropriate frequentist methods and Bayesian credible intervals as more reliable alternatives.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Sam Bowyer (3 papers)
  2. Laurence Aitchison (66 papers)
  3. Desi R. Ivanova (8 papers)
Github Logo Streamline Icon: https://streamlinehq.com