The paper addresses the problem of uncertainty quantification in evaluating LLMs on benchmarks. It argues that while methods based on the Central Limit Theorem (CLT) are suitable for large benchmarks (thousands of examples), they are unreliable for smaller, more specialized benchmarks, which are becoming increasingly common. The authors demonstrate that CLT-based methods underestimate uncertainty in small-data settings and propose alternative frequentist and Bayesian methods. A Python library, `bayes_evals', is provided for implementing the Bayesian methods.
The authors highlight that many current LLM benchmarks contain large evaluation sets, focusing on tasks that LLMs have largely mastered. However, real-world applications often require more targeted benchmarks with high-quality labels, which are more expensive to create and thus involve fewer examples. The paper cites examples such as CUAD, FrontierMath, SWE Bench Verified, MLE Bench, and LiveBench to illustrate this trend. The authors also note that even large benchmarks like Big Bench are often broken down into smaller sub-tasks, violating the independent and identically distributed (IID) assumption required for the CLT.
To demonstrate the limitations of CLT-based confidence intervals, the authors conduct experiments with simulated data, comparing frequentist and Bayesian methods. They generate datasets with known parameter values and measure the coverage of different intervals, i.e., the proportion of times the intervals contain the true value. The results show that CLT-based methods produce unreliable intervals that fail to achieve their target confidence level, especially with smaller datasets.
The paper then discusses the Central Limit Theorem and its role in frequentist uncertainty quantification. The CLT states that for a sufficiently large sample size , the sampling distribution of the sample mean will be approximately normal, regardless of the distribution of the original population. Formally, if are IID random variables with mean and finite variance , then
as ,
where is the sample mean.
- is the sample mean
- is the population mean
- is the sample size
- is the population variance
In practice, the population variance is rarely known and is estimated empirically by the sample variance, . The standard error (SE) of the sample mean is given by $\text{SE}(\hat{\mu}) = \sqrt{\nicefrac{S^2}{N}}$.
For a desired confidence level , the general form of a two-sided CLT-based confidence interval for the parameter of interest is written as
$\text{CI}_{1-\alpha}(\mu) = \hat{\mu} \pm z_{\nicefrac{\alpha}{2}}\text{SE}(\hat{\mu})$,
- is the confidence interval for with confidence level
- is the sample mean
- $z_{\nicefrac{\alpha}{2}}$ is the $\nicefrac{\alpha}{2}$-th quantile of the standard normal distribution
- is the standard error of the sample mean
When comparing two independent samples with sizes and and sample variances and , the standard error of the difference in sample means is $\text{SE}(\hat{\mu}_A - \hat{\mu}_B) = \sqrt{\nicefrac{S_A^2}{N_A} + \nicefrac{S_B^2}{N_B}}$.
For paired samples, where each observation in sample is matched with one in sample , the CLT applies directly to the differences , which are treated as a single sample.
The paper provides several examples of the failures of CLT-based confidence intervals in LLM evaluations. First, it shows that CLT-based confidence intervals fail in the IID questions setting. It is common to assume that benchmarks contain IID questions, so that each eval outcome is a Bernoulli trial with some underlying probability of success . However, as the empirical mean approaches 0 or 1, the confidence interval shrinks towards 0, incorrectly suggesting certainty.
The authors evaluate the coverage of different types of intervals by generating data from
$\theta \sim \betadist(1, 1)$,
$y_i \sim \bernoulli(\theta)$ for .
They find that both CLT-based and bootstrap confidence intervals show poor calibration, with actual coverage well below the nominal level. In contrast, Bayesian methods based on a Beta-Bernoulli distribution and the frequentist Wilson Score interval perform well. The posterior is available in closed form:
$P(\theta|y_{1:N}) = \betadist(1+\sum_{i=1}^N y_i, 1 + \sum_{i=1}^N (1-y_i))$.
Then, the paper discusses the failure of CLT-based confidence intervals in the clustered questions setting, such as reading comprehension benchmarks with multiple questions about a single passage of text. To address this, the authors suggest using clustered standard errors, a post-hoc adjustment to account for the correlation among the clusters with questions each.
To assess the effectiveness of this approach, the authors use data from the following generative model:
$d \sim \gammadist(1, 1)$,
$\theta \sim \betadist(1, 1)$
$\theta_t \sim \betadist(d \theta, d (1-\theta))$,
$y_{i,t} \sim \bernoulli(\theta_t)$.
- controls the range of difficulties of the tasks or clusters
- is the global performance of the model
- is the performance on a given task
They perform inference using importance sampling with the prior as proposal. The results show that only the Bayesian method based on a clustered model achieves the right coverage across different sample sizes.
The paper also examines the failure of CLT-based confidence intervals in the independent model comparison setting. They consider two LLMs, and , with true probabilities of success and . They construct confidence intervals on the difference in performances, , and the odds ratio, . The CLT can only construct a confidence interval on the difference in performances, since the odds ratio is a non-linear transformation of the parameters. The results show that the CLT-based interval has coverage far below the target level when is small. For odds ratio analysis, the standard frequentist method involves inverting the Fisher's exact test (FET), which tends to be overly conservative for small . The Bayesian approach is able to give credible intervals for both the difference and the odds ratio.
The authors highlight a key benefit of the Bayesian approach to model comparison: the ability to compute probabilities that one model outperforms another:
$\mathbb{P}(\theta_A > \theta_B | y_{A;1:N}, y_{B;1:N}) = \frac{1}{K} \sum_{k=1}^K \mathbbm{1}[\theta_A^{(k)} > \theta_B^{(k)}]$.
The paper also addresses the failure of CLT-based confidence intervals in paired model comparison settings, where two models have been evaluated on the same set of questions. To simulate paired evals data, the authors sample probabilities of success for each model, along with a correlation , and then sample points from a bivariate Gaussian:
,
where is the standard univariate Gaussian cumulative distribution function (CDF).
The authors perform Bayesian inference on the posterior distribution of using importance sampling. The results show that all non-Bayesian methods severely underperform when it comes to achieving nominal coverage for small .
Finally, the paper discusses the failure of CLT-based confidence intervals when metrics are not averages of IID variables. Many metrics for LLM evaluations, such as -scores, Matthews correlation coefficient (MCC) or G-score, are non-linear in the parameters, so the CLT is not applicable. As an example, consider the score:
$F_1 = 2\,\frac{\precision\cdot\recall}{\precision+\recall}$,
where $\precision = \frac{N_\text{TP}}{N_\text{TP} + N_\text{FP}}$ and $\recall = \frac{N_\text{TP}}{N_\text{TP} + N_\text{FN}}$.
The authors simulate an evaluation dataset by sampling ground truth parameters from a uniform Dirichlet prior, which is conjugate to the categorical likelihood:
$\bm{\theta} \sim \dirichlet(1, 1, 1, 1)$,
$y_i \sim \categorical(\bm{\theta})$,
$\bm{\theta} | y_{1:N} \sim \dirichlet(1 + N_\text{conf})$.
- is the parameter vector
- is the outcome of the -th trial
- is the vector of counts in the confusion matrix
The authors compare Bayesian credible intervals against the bootstrap and find that the Bayesian intervals closely track the nominal coverage, while the bootstrap ones systematically under-cover.
In conclusion, the paper recommends against using the CLT to construct confidence intervals for LLM evaluations, arguing that the assumptions are rarely satisfied. They suggest using more appropriate frequentist methods and Bayesian credible intervals as more reliable alternatives.