Position: Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints
(2503.01747v3)
Published 3 Mar 2025 in cs.AI, cs.LG, and stat.ML
Abstract: Rigorous statistical evaluations of LLMs, including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios. We provide a simple Python library for these Bayesian methods at https://github.com/sambowyer/bayes_evals .
The paper addresses the problem of uncertainty quantification in evaluating LLMs on benchmarks. It argues that while methods based on the Central Limit Theorem (CLT) are suitable for large benchmarks (thousands of examples), they are unreliable for smaller, more specialized benchmarks, which are becoming increasingly common. The authors demonstrate that CLT-based methods underestimate uncertainty in small-data settings and propose alternative frequentist and Bayesian methods. A Python library, `bayes_evals', is provided for implementing the Bayesian methods.
The authors highlight that many current LLM benchmarks contain large evaluation sets, focusing on tasks that LLMs have largely mastered. However, real-world applications often require more targeted benchmarks with high-quality labels, which are more expensive to create and thus involve fewer examples. The paper cites examples such as CUAD, FrontierMath, SWE Bench Verified, MLE Bench, and LiveBench to illustrate this trend. The authors also note that even large benchmarks like Big Bench are often broken down into smaller sub-tasks, violating the independent and identically distributed (IID) assumption required for the CLT.
To demonstrate the limitations of CLT-based confidence intervals, the authors conduct experiments with simulated data, comparing frequentist and Bayesian methods. They generate datasets with known parameter values and measure the coverage of different intervals, i.e., the proportion of times the intervals contain the true value. The results show that CLT-based methods produce unreliable intervals that fail to achieve their target confidence level, especially with smaller datasets.
The paper then discusses the Central Limit Theorem and its role in frequentist uncertainty quantification. The CLT states that for a sufficiently large sample size N, the sampling distribution of the sample mean will be approximately normal, regardless of the distribution of the original population. Formally, if X1,…,XN are IID random variables with mean μ and finite variance σ2, then
N(μ^−μ)d(0,σ2) as N→∞,
where μ^=N1∑i=1NXi is the sample mean.
μ^ is the sample mean
μ is the population mean
N is the sample size
σ2 is the population variance
In practice, the population variance σ2 is rarely known and is estimated empirically by the sample variance, S2=N−11i=1∑N(Xi−μ^)2. The standard error (SE) of the sample mean is given by $\text{SE}(\hat{\mu}) = \sqrt{\nicefrac{S^2}{N}}$.
For a desired confidence level 1−α, the general form of a two-sided CLT-based confidence interval for the parameter of interest μ is written as
CI1−α(μ) is the confidence interval for μ with confidence level 1−α
μ^ is the sample mean
$z_{\nicefrac{\alpha}{2}}$ is the $\nicefrac{\alpha}{2}$-th quantile of the standard normal distribution
SE(μ^) is the standard error of the sample mean
When comparing two independent samples with sizes NA and NB and sample variances SA and SB, the standard error of the difference in sample means is $\text{SE}(\hat{\mu}_A - \hat{\mu}_B) = \sqrt{\nicefrac{S_A^2}{N_A} + \nicefrac{S_B^2}{N_B}}$.
For paired samples, where each observation in sample A is matched with one in sample B, the CLT applies directly to the differences Di=XA,i−XB,i, which are treated as a single sample.
The paper provides several examples of the failures of CLT-based confidence intervals in LLM evaluations. First, it shows that CLT-based confidence intervals fail in the IID questions setting. It is common to assume that benchmarks contain IID questions, so that each eval outcome is a Bernoulli trial with some underlying probability of success θ. However, as the empirical mean θ^ approaches 0 or 1, the confidence interval shrinks towards 0, incorrectly suggesting certainty.
The authors evaluate the coverage of different types of intervals by generating data from
$\theta \sim \betadist(1, 1)$,
$y_i \sim \bernoulli(\theta)$ for i=1,…N.
They find that both CLT-based and bootstrap confidence intervals show poor calibration, with actual coverage well below the nominal 1−α level. In contrast, Bayesian methods based on a Beta-Bernoulli distribution and the frequentist Wilson Score interval perform well. The posterior is available in closed form:
Then, the paper discusses the failure of CLT-based confidence intervals in the clustered questions setting, such as reading comprehension benchmarks with multiple questions about a single passage of text. To address this, the authors suggest using clustered standard errors, a post-hoc adjustment to account for the correlation among the T clusters with Nt questions each.
To assess the effectiveness of this approach, the authors use data from the following generative model:
$d \sim \gammadist(1, 1)$,
$\theta \sim \betadist(1, 1)$
$\theta_t \sim \betadist(d \theta, d (1-\theta))$,
$y_{i,t} \sim \bernoulli(\theta_t)$.
d controls the range of difficulties of the tasks or clusters
θ is the global performance of the model
θt is the performance on a given task
They perform inference using importance sampling with the prior as proposal. The results show that only the Bayesian method based on a clustered model achieves the right coverage across different sample sizes.
The paper also examines the failure of CLT-based confidence intervals in the independent model comparison setting. They consider two LLMs, A and B, with true probabilities of success θA and θB. They construct confidence intervals on the difference in performances, Diff=θA−θB, and the odds ratio, OR=θB/(1−θB)θA/(1−θA). The CLT can only construct a confidence interval on the difference in performances, since the odds ratio is a non-linear transformation of the parameters. The results show that the CLT-based interval has coverage far below the target level when N is small. For odds ratio analysis, the standard frequentist method involves inverting the Fisher's exact test (FET), which tends to be overly conservative for small N. The Bayesian approach is able to give credible intervals for both the difference and the odds ratio.
The authors highlight a key benefit of the Bayesian approach to model comparison: the ability to compute probabilities that one model outperforms another:
The paper also addresses the failure of CLT-based confidence intervals in paired model comparison settings, where two models have been evaluated on the same set of questions. To simulate paired evals data, the authors sample probabilities of success for each model, along with a correlation ρ, and then sample N points from a bivariate Gaussian:
where Φ(⋅) is the standard univariate Gaussian cumulative distribution function (CDF).
The authors perform Bayesian inference on the posterior distribution of θA−θB using importance sampling. The results show that all non-Bayesian methods severely underperform when it comes to achieving nominal coverage for small N.
Finally, the paper discusses the failure of CLT-based confidence intervals when metrics are not averages of IID variables. Many metrics for LLM evaluations, such as Fβ-scores, Matthews correlation coefficient (MCC) or G-score, are non-linear in the parameters, so the CLT is not applicable. As an example, consider the F1 score:
where $\precision = \frac{N_\text{TP}}{N_\text{TP} + N_\text{FP}}$ and $\recall = \frac{N_\text{TP}}{N_\text{TP} + N_\text{FN}}$.
The authors simulate an evaluation dataset by sampling ground truth parameters from a uniform Dirichlet prior, which is conjugate to the categorical likelihood:
Nconf is the vector of counts in the confusion matrix
The authors compare Bayesian credible intervals against the bootstrap and find that the Bayesian intervals closely track the nominal coverage, while the bootstrap ones systematically under-cover.
In conclusion, the paper recommends against using the CLT to construct confidence intervals for LLM evaluations, arguing that the assumptions are rarely satisfied. They suggest using more appropriate frequentist methods and Bayesian credible intervals as more reliable alternatives.