Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 58 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Signal and Noise in LLM Evaluation

Updated 20 August 2025
  • Signal and Noise Framework is a set of rigorous methodologies for quantifying model score dispersion (signal) and variability (noise) in LLM benchmarks.
  • It defines key metrics such as relative dispersion, relative standard deviation, and SNR to improve decision reliability and predictive accuracy.
  • The framework recommends actionable interventions like metric selection, filtering noisy subtasks, and checkpoint averaging to enhance benchmark performance.

The Signal and Noise Framework encompasses a set of rigorous concepts and methodologies for analyzing, quantifying, and designing evaluation benchmarks in the context of LLM development. Its foundation is the recognition that typical LLM development cycles rely on decisions made from small-scale experiments, evaluated on (potentially noisy or uninformative) multi-task benchmark suites. The framework introduces quantitative, statistical definitions of "signal" and "noise," establishes their relationship to decision reliability and predictive accuracy, and proposes direct interventions to optimize benchmark quality. These developments are supported by empirical results over 30 benchmarks, 375 open-weight LLMs, and 900K benchmark runs (Heineman et al., 18 Aug 2025).

1. Formal Definitions: Signal, Noise, and SNR

The framework defines signal as a benchmark’s discriminatory power: its ability to separate the scores of “better” models from those of “worse” models across an evaluation suite. Concretely, signal is measured via the relative dispersion of the scores achieved by an ensemble of models: Rel. Dispersion(M)=maxj,kmjmkmˉ\text{Rel. Dispersion}(M) = \frac{\max_{j,k}|m_j - m_k|}{\bar{m}} where mjm_j and mkm_k are the scores of models jj and kk, and mˉ\bar{m} is the mean score across all models.

Noise encapsulates the benchmark’s sensitivity to random, non-systematic fluctuation—primarily checkpoint-level randomness between training runs. It is quantified (per model) as the relative standard deviation over the last nn training checkpoints: Rel. Std(m)=1n1i=1n(mimˉ)2mˉ\text{Rel. Std}(m) = \frac{\sqrt{\frac{1}{n-1}\sum_{i=1}^n (m_i - \bar{m})^2}}{\bar{m}} where mim_i is the evaluation score at checkpoint ii and mˉ\bar{m} is their mean.

The crucial metric Signal-to-Noise Ratio (SNR) is defined as the ratio between signal (relative dispersion across model scores) and noise (typical per-model relative standard deviation across checkpoints): SNR=Rel. Dispersion(M)Rel. Std(m)\text{SNR} = \frac{\text{Rel. Dispersion}(M)}{\text{Rel. Std}(m)} A high SNR corresponds to a benchmark where inter-model score differences significantly exceed intra-model stochastic fluctuations, implying more reliable, actionable differences in any evaluation.

The framework further considers scaling law prediction error, measuring the accuracy of small model-based performance extrapolations to larger models: Prediction Error=Measured ValueTrue ValueTrue Value\text{Prediction Error} = \frac{|\text{Measured Value} - \text{True Value}|}{|\text{True Value}|}

2. Interventions for Improving Benchmark Reliability

Three practical interventions are proposed to modulate signal and noise in benchmark design:

(a) Metric Selection:

Switching from discontinuous metrics (e.g., accuracy, exact match) to a continuous loss (e.g., bits-per-byte (BPB)) increases both signal and reduces noise. BPB, defined as the negative log-likelihood of the correct answer normalized by UTF-8 byte count, produces smoother, more dispersed, and less volatile model-wise score distributions—directly producing higher SNR benchmarks.

(b) Filtering Noisy Subtasks:

For multi-task or composite benchmarks, individual subtasks may vary in their SNR. By computing SNR for each subtask and curating a high-SNR subset (ranking and selecting the top-k subtasks), the aggregate benchmark’s SNR improves, leading to higher decision accuracy and lower prediction error. Empirical results demonstrate that high-SNR subsets (e.g., a 16-task subset from MMLU) perform better than full, noisier task suites for a given evaluation budget.

(c) Averaging over Checkpoints:

Averaging the outputs (scores) from the last several model checkpoints—rather than relying on a single run—reduces stochastic noise. Both (i) development model scores used for small-scale decision-making and (ii) target model scores used for scaling law fits benefit from checkpoint averaging, leading to more reliable model ranking and predictive performance characterization.

3. Empirical Results and Decision-Theoretic Implications

Benchmarks with higher SNR empirically exhibit higher decision accuracy—the probability that development-time, small-model rankings match those at the intended large-model scale. Formally, decision accuracy over all model pairs P\mathcal{P} is: Decision Accuracy=1P(a,b)PI[sign(B(sa)B(sb))=sign(B(ma)B(mb))]\text{Decision Accuracy} = \frac{1}{|\mathcal{P}|} \sum_{(a, b) \in \mathcal{P}} \mathbb{I}\Big[\operatorname{sign}(B(s_a) - B(s_b)) = \operatorname{sign}(B(m_a) - B(m_b))\Big] with B(sa)B(s_a), B(sb)B(s_b) the benchmark scores of small models sa,sbs_a, s_b and similarly B(ma),B(mb)B(m_a), B(m_b) for large models.

Lower noise, as evidenced by checkpoint variance, correlates with lower scaling law prediction error, ensuring more precise extrapolation from resource-efficient small-scale experiments to large-scale model performance.

The experimental paper spans 30 evaluation benchmarks, 375 models (60M–32B parameters), and nearly 1 million distinct evaluations, robustly confirming the tight relationship between SNR and reliability in both ranking and predictive settings.

4. Recommendations for Benchmark Designers and Practitioners

To enhance the reliability and utility of LLM benchmarks, the following are recommended:

  • Design or select benchmarks to maximize signal: ensure that model scores are well dispersed, avoiding tasks where all plausible models collapse to similar scores.
  • Minimize measured noise: use smoother, continuous metrics (e.g., bits-per-byte), average over multiple checkpoints, and maximize dataset homogeneity.
  • Filter or reweight low-SNR subtasks: prioritize subtasks that yield high SNR, even if it means reducing the size or breadth of the benchmark.
  • When fitting scaling laws or making cross-scale predictions, leverage checkpoint averaging to robustly suppress stochasticity-induced errors.
  • Quantify and report SNR explicitly for both the full and filtered benchmarks, to support evidence-based benchmark selection.

These recommendations are operationalized by directly computing the SNR and related statistics as outlined in the formalism above.

5. Quantitative Formulations and Key Equations

The framework is built upon explicit, reproducible mathematical definitions:

Metric Formula Description
Relative Dispersion maxj,kmjmkmˉ\frac{\max_{j,k}|m_j - m_k|}{\bar{m}} Measures score spread (signal)
Relative Standard Deviation 1/(n1)i=1n(mimˉ)2mˉ\frac{\sqrt{{1}/{(n-1)} \sum_{i=1}^{n} (m_i - \bar{m})^2}}{\bar{m}} Measures score variance (noise)
Signal-to-Noise Ratio (SNR) Rel. Dispersion(M)Rel. Std(m)\frac{\text{Rel. Dispersion}(M)}{\text{Rel. Std}(m)} Composite benchmark reliability
Decision Accuracy 1P(a,b)PI[sign(B(sa)B(sb))=sign(B(ma)B(mb))]\frac{1}{|\mathcal{P}|} \sum_{(a, b) \in \mathcal{P}} \mathbb{I}\left[\operatorname{sign}(B(s_a)-B(s_b)) = \operatorname{sign}(B(m_a)-B(m_b))\right] Development/truth ranking alignment
Prediction Error Measured ValueTrue ValueTrue Value\frac{|\text{Measured Value} - \text{True Value}|}{|\text{True Value}|} Scaling law accuracy metric

6. Broader Implications and Future Directions

The Signal and Noise Framework for LLM evaluation provides a principled paradigm for diagnosing and improving benchmark utility, transforming benchmark design into a quantitative, optimization-driven discipline. Its concepts extend beyond LLM selection—any high-stakes model evaluation, especially in resource-constrained, small-experiment regimes, can benefit from explicit SNR quantification and the associated interventions. As LLM evaluation suites continue to proliferate and diversify, integrating SNR-based design principles is likely to become standard practice for ensuring both rigorous model comparison and efficient resource allocation (Heineman et al., 18 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Signal and Noise Framework.