Papers
Topics
Authors
Recent
Search
2000 character limit reached

You Don't Need to Run Every Eval

Published 22 Jun 2026 in cs.LG | (2606.24020v1)

Abstract: A modern model release reports scores on 40+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release. But do we need to run every eval? We compile a public score matrix of 84 frontier models on 133 benchmarks (2,604 cells, 23.3% filled) and find it is approximately rank-2: a model's scores across all 133 benchmarks are largely determined by just two numbers. We confirm this in two ways: scores hidden from the matrix are best recovered using two factors, and two factors already explain over 90% of the variation among models on the benchmarks they share. Building on this, we design BenchPress: a logit-space rank-2 matrix completion method that recovers held-out scores to within 4.6 points, and a confidence layer that says when each prediction can be trusted. Using BenchPress, we find a subset of five benchmarks {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} that can recover the rest of a model's public scorecard to within 3.93 points. For a tighter inference budget, a cheaper set {GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026} can predict a model's evals to within 4.55. We release the score matrix, the BenchPress code, and an interactive tool that predicts any model's score on any benchmark.

Summary

  • The paper demonstrates that a rank-2 latent structure explains over 90% of LLM performance variance across diverse benchmarks.
  • The authors introduce BENCHPRESS, a logit-space ALS matrix completion algorithm that infers missing scores with a median error of ~4.6 points.
  • Using a small, strategically chosen probe set, the approach preserves model rankings and enables cost-effective, reliable LLM evaluation.

Summary of "You Don't Need to Run Every Eval" (2606.24020)

Motivation and Problem Formulation

The costly and redundant nature of LLM benchmark evaluation has become evident as modern frontier model releases report scores across dozens of benchmarks. Each evaluation run incurs significant monetary and wall-clock costs, especially considering the repetition needed for model selection, training progress tracking, and deployment decisions. The paper investigates whether it's necessary to conduct every individual benchmark evaluation or if a smaller subset can efficiently infer the remainder of a model's evaluation suite with high fidelity.

Score Matrix Construction and Low-Rank Structure

A comprehensive public score matrix is compiled containing 84 recent frontier models (spanning 13 providers) evaluated on 133 benchmarks. The matrix is sparse (23.3% filled) but encompasses a broad benchmark spectrum: math, coding, agentic, multimodal, factuality, preference, and safety, among others.

Critical analysis reveals that this matrix is effectively rank-2, i.e., two latent factors have the capacity to explain over 90% of performance variance across models and benchmarks. This finding is supported by two lines of evidence:

  • Soft-Impute rank sweep: Median absolute percentage error (MedAPE) on held-out data is minimized at rank 2 both for raw-score and logit-transformed spaces.
  • SVD: Singular value decompositions of fully-observed submatrices consistently show that two components dominate variance.

The operational implication is that a model's scores across disparate benchmarks are essentially determined by two numbers, refuting the independence assumption of benchmark scores.

BENCHPRESS: Logit-Space Low-Rank Matrix Completion

Building on the observed low-rank structure, the authors develop BENCHPRESS, a logit-transformed, bias-decomposed alternating least squares (ALS) matrix completion algorithm. Its recipe:

  1. Apply logit transform to percentage scores, standardize each column.
  2. Fit a rank-2 bias-decomposed ALS, using a global offset, row (model) offset, column (benchmark) offset, and a rank-2 residual correction.
  3. Invert the standardization and the logit transform to map predicted scores back to original scale.

BENCHPRESS achieves a median absolute error (MedAE) of 4.63 points on held-out entries at full coverage. Regression-based alternatives provide marginally better MedAE but at lower coverage, while BENCHPRESS offers deterministic, full-coverage prediction.

Benchmark Probe Selection and Scorecard Recovery

A key practical application is budgeted scorecard recovery: selecting a small probe set of benchmarks to run, and using BENCHPRESS to infer the rest. Using a greedy selection strategy, five benchmarks (GPQA Diamond, HLE, Codeforces, MMLU-Pro, ARC-AGI-1) recover the remainder of a model’s scorecard to within 3.93 points MedAE (or 4.55 with low-cost probes), outperforming random probe sets by significant margins.

Notably, reasoning and math-oriented benchmarks dominate probe sets, reflecting their high mutual predictivity and alignment with the principal axes of matrix variance.

Ranking Preservation

The preservation of pairwise model rankings is evaluated. With a five-point margin on true scores, BENCHPRESS-completed scores match the actual orderings 92.1% of the time for same-benchmark model pairs. For practical deployment, this validates the use of matrix-completed scores for comparative evaluation and candidate shortlisting under uncertainty tolerances.

Temporal Generalization for Deployment

For newly released models absent from the training matrix, BENCHPRESS can still predict scorecards reliably after a small seed evaluation. With five revealed scores, MedAE drops to 4.83 points; with ten, it falls to 2.57. The error distribution narrows as more probes are observed, demonstrating robust temporal transfer given sufficient anchor points.

Prediction Reliability and Trustworthiness

Prediction reliability is characterized by both benchmark-side and model-side factors:

  • Benchmark-side: Prediction becomes harder with wider score spread (across models), fewer observed model scores, and absence of strongly correlated benchmark neighbors.
  • Model-side: Reasoning models and higher-scoring models are easier to predict, as are models with many observed scores and correlated peers, and those anchored by recent models in the training matrix.

A hybrid reliability estimator (using both ensemble spread and matrix support features) reliably identifies low-risk predictions, achieving selective MedAE as low as 1.83 points for the safest 20% of predictions. This estimator enables practitioners to triage which predictions are trustworthy enough to substitute for benchmark runs.

Limitations and Future Directions

The approach is inherently snapshot-dependent: expansion of the matrix or emergence of models with new capability profiles may alter the rank-2 geometry underlying BENCHPRESS. Prediction is only as good as benchmark construction and reporting; for noisy or poorly specified benchmarks, predictions faithfully reproduce noise. Application to instance-level benchmark outcomes or specialized domains (audio, robotics, scientific simulation) remains unexplored. Integrating metadata—model architecture, training data, external features—could further anchor predictions for outliers. Regular recalibration of probe sets and rank selection is required as the matrix evolves.

Conclusion

The paper demonstrates that large-scale LLM benchmark evaluation is highly redundant and governed by a low-dimensional latent structure. BENCHPRESS provides principled, efficient matrix completion based on logit-space ALS, enabling substantial reduction in evaluation effort, robust scorecard recovery, and preservation of model ranking decisions with strong numerical guarantees. The practical implications are significant: practitioners can conduct evaluations with a small subset of informative benchmarks and rely on matrix completion for the remainder, subject to explicit reliability characterization. Future work should address cross-snapshot stability, extension to new domains, integration of model metadata for outlier prediction, and instance-level evaluation compression.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

Overview: What this paper is about

Imagine every AI model (like ChatGPT or Claude) has a giant report card with scores on lots of different tests—math, coding, reasoning, tool use, and more. Running all those tests is slow and expensive. This paper asks: do we really need to run every test every time?

The authors show that you can usually guess the rest of a model’s report card from just a few test results. They build a simple tool called BENCHPRESS that predicts missing scores very accurately, so people can run fewer tests and still understand how a model performs.

What questions did the authors ask?

  • Are all these different AI tests measuring lots of different things, or mostly the same few skills?
  • If we know a model’s score on a small set of tests, can we reliably predict its scores on the rest?
  • Which small set of tests should we run to get the best picture of a model?
  • When should we trust these predictions, and when should we be careful?

How did they study it? (Simple explanation of the methods)

Think of a big table:

  • Rows = AI models (84 models).
  • Columns = benchmarks (133 tests).
  • Cells = scores (only about 23% of cells were filled from public sources).

The authors did four main things:

  1. Built the “big table” of scores
  • They collected public scores from blogs, model cards, leaderboards, and reports.
  • They cleaned it up so different versions of the same test or model didn’t double-count.
  1. Looked for hidden patterns
  • They found that a model’s performance across many tests is mostly governed by just two hidden factors—like two “sliders” that control how well a model does on lots of different tests.
  • In math terms, they say the table is “rank-2,” which means two main ingredients explain most of the variation. A good analogy is music charts: if you know someone’s taste along two axes (say “pop vs. rock” and “old vs. new”), you can predict a lot of their song ratings.
  1. Built a predictor called BENCHPRESS
  • BENCHPRESS is a “fill-in-the-blanks” method for the score table. This approach is similar to how streaming apps recommend movies: they look at patterns across people and films to guess what you’ll like. Here, it looks at patterns across models and tests to guess missing scores.
  • It uses a technique called “matrix completion,” together with a smart re-scaling of percentage scores so that differences near 0% and 100% are treated fairly. In practice:
    • It adjusts for each model’s overall level and each test’s difficulty.
    • Then it uses two hidden factors (the “two sliders”) to fine-tune the guess.
  1. Added a “trust meter”
  • The tool also estimates how confident it is about each prediction, based on things like:
    • How much data there is for that model and test,
    • Whether there are similar models to compare to,
    • How recently the data was updated,
    • How much different reasonable methods disagree.
  • It produces prediction intervals (like: “we’re 90% sure the score is between X and Y”).

They also compared BENCHPRESS to asking a powerful AI (like GPT‑5.5) to guess scores directly. The AI did well when it saw real names of models and tests (which might tap into memory of public leaderboards), but BENCHPRESS was more consistent and cheaper when names were hidden and when you need many predictions.

What did they find, and why is it important?

  1. Two numbers explain most of the score patterns
  • Across all those tests and models, two hidden factors explain over 90% of the differences.
  • This means many tests overlap a lot in what they measure.
  1. Accurate predictions from a few tests
  • BENCHPRESS predicts missing scores with a typical error of about 4.6 points (on a 0–100 scale) when filling the whole table.
  • Even better: if you actually run just 5 carefully chosen tests for a new model, you can predict the rest of its public scorecard with a typical error of about:
    • 3.93 points using a best-performing set of five tests.
    • 4.55 points using a cheaper, low-compute set of five tests.

Example five-test sets the paper found: - High-signal set: GPQA-Diamond, HLE, Codeforces Rating, MMLU‑Pro, ARC‑AGI‑1 - Lower-cost set: GPQA‑Diamond, MMLU‑Pro, Aider Polyglot, MATH‑500, AIME 2026

  1. Rankings are mostly preserved
  • If you care about “which model is better on this test?”, predictions preserve 92.1% of pairwise model orderings (within a 5-point tolerance). That’s useful for quick comparisons.
  1. Works on new models too
  • For models that weren’t in the training data, BENCHPRESS still does well: with 5 seed scores, the typical error was about 5.0 points.
  1. LLMs vs. BENCHPRESS as predictors
  • A powerful AI can guess some scores if it sees real model and test names (likely using public knowledge).
  • But when names are hidden, BENCHPRESS is more accurate and far more scalable (it fits once and predicts everything, rather than paying for many AI guesses).

Why this matters:

  • Running fewer tests saves time and money without losing much accuracy.
  • Teams can quickly track training progress, compare design choices, and decide which checkpoints to release.
  • Users can shortlist models faster for their needs.

What are the limits and cautions?

The authors are careful about what this does—and doesn’t—mean:

  • Mixed data sources: Many scores come from different places, with different prompts and settings. That adds noise and bias.
  • Snapshot: The “two-slider” pattern is based on the current collection (84 models × 133 tests). If future models are very different, the pattern could shift.
  • Benchmarks still matter: Predictability doesn’t mean tests are pointless. You still need real evaluations to:
    • Find new failure modes,
    • Check for cheating or contamination,
    • Track distribution shifts,
    • Shape good incentives for model developers.
  • Probe sets can change: The best “five tests” today might not be the best tomorrow as benchmarks and models evolve.

Bottom line: Why this research is useful

This paper shows that AI model evaluations are highly connected—many tests check similar underlying abilities. Because of that, you don’t always have to run every test to get a good picture. With BENCHPRESS, a few well-chosen tests can predict the rest, giving:

  • Faster and cheaper evaluations,
  • Reliable comparisons between models,
  • Practical guidance on which tests to run first,
  • A built-in “trust meter” for knowing when predictions are safe to use.

They also released the dataset and code, so others can try it, improve it, and keep it up to date as new models and benchmarks appear.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, focused list of the main uncertainties and unaddressed issues that the paper leaves open, phrased to guide actionable follow-up work:

  • Quantify and correct heterogeneous reporting bias: How much do vendor-reported scores, differing harnesses, judges, prompts, and decoding settings inflate cross-benchmark correlations and compress apparent rank? Run controlled re-evaluations to measure and de-bias the matrix.
  • Model measurement noise explicitly: Incorporate per-benchmark uncertainty (e.g., replicate variance, item counts) via weighted matrix completion; assess whether weighting reduces prediction error and alters probe-set choices.
  • Address structured missingness (MNAR): Popular models × benchmarks dominate; evaluate MNAR-aware matrix completion and selection models, and test on synthetic balanced subsamples to quantify bias from non-uniform observation.
  • Generalization under temporal and distribution shift: Use time-split evaluations (train on data up to date T, test on post-T models/benchmarks) and hold out entire benchmark families (e.g., safety, multimodal) to stress-test the rank-2 assumption and error.
  • Detect and adapt when rank-2 breaks: Develop online tests to flag when new rows/columns are incompatible with rank-2 geometry and devise adaptive-rank or mixture-of-factors alternatives.
  • Cold-start capability: What can be predicted for a brand-new model with zero seed scores? Explore hybrid models that use metadata (provider, size, reasoning mode, release date) to provide initial priors before any probes.
  • Interpretability of latent factors: What do the two factors capture (e.g., general reasoning vs. knowledge, provider-specific residuals)? Study factor stability across time, providers, and benchmark categories.
  • Category- and modality-specific reliability: Provide fine-grained error and calibration for underrepresented categories (safety/behavior, hallucination/factuality, long-context, multimodal/vision) and investigate whether bespoke subspace models improve accuracy.
  • Robustness to non-percentage scales: Systematically evaluate and extend link functions for non-percentage metrics (e.g., Elo, Codeforces rating) and consider per-benchmark monotone transforms instead of a one-size-fits-all logit.
  • Hyperparameter and initialization sensitivity: Quantify variability from ALS regularization and random initializations; provide stability diagnostics and recommended settings for practitioners.
  • Uncertainty calibration under MNAR and shift: The conformal intervals and trust probabilities are proposed—validate coverage and sharpness on future releases and out-of-domain benchmarks; analyze when ensemble spread miscalibrates true error.
  • Tail-risk characterization: Report and optimize for 90th/95th percentile errors (not only medians); design risk-aware probe selection objectives that bound worst-case deviations.
  • Probe-set overfitting and portability: Current probe sets are chosen on the same snapshot; evaluate selection on held-out future data and propose procedures (e.g., cross-time selection, stability selection) that yield durable probe sets.
  • Richer, explicit cost models: Move beyond a “low-cost allowlist” to per-benchmark dollar/time/latency costs and parallelization constraints; formulate multi-objective probe selection (cost, MedAE/MedAPE, risk, category coverage).
  • Adversarial/gaming resistance: Study whether models can overfit to the chosen probes to inflate imputed scores; design rotation, diversification, or audit mechanisms to mitigate Goodhart’s law.
  • Provider- and source-bias correction: With ~80% provider-sourced scores, quantify optimistic bias vs. third-party runs; learn and remove source-specific offsets or fit separate models by provenance.
  • Harness/judge normalization: Introduce multi-view or mixed-effects models to adjust for systematic differences between evaluation harnesses and judges within the same benchmark.
  • Active benchmark design: Use residuals and factor loadings to propose new benchmarks that maximally increase rank or reduce predictive uncertainty; formalize an information-gain criterion for new column creation.
  • Integration with item-level compression: Quantify end-to-end savings and accuracy by combining BENCHPRESS (cross-benchmark) with item-level methods (e.g., Scales++, MetaBench); define coordinated probe+item selection strategies.
  • Metadata-augmented predictors: Compare pure collaborative filtering to models that incorporate side-information (architecture, training data scale, reasoning setting) for better cold-start and OOD performance.
  • Benchmark contamination and leakage: Assess how training-data contamination or benchmark familiarity affects cross-benchmark correlations and low-rank geometry; validate on decontaminated subsets.
  • OOD novelty and anomaly detection: Build detectors that trigger “don’t trust” flags for predictions when model/benchmark representations are far from the training manifold; define automatic fallback to running the real eval.
  • Per-benchmark transform selection: Explore per-column link functions or generalized linear models that better handle ceiling/floor effects and varying difficulty profiles instead of a uniform logit transform.
  • Multi-lingual and domain coverage: Evaluate whether low-rank structure persists across non-English tasks, domain-specific evaluations (e.g., legal, medical), and code vs. natural language; extend the matrix accordingly.
  • Comparative baselines: Test stronger probabilistic and non-linear models (e.g., Bayesian PMF with hierarchical priors, MNAR-aware CF, kernelized/non-linear latent factor models) with calibrated uncertainty against BENCHPRESS.
  • Governance and incentives: Analyze whether free imputed scores disincentivize actual evaluation; propose reporting standards that balance efficiency with scientific rigor and safety monitoring.

Practical Applications

Below are actionable, real-world applications derived from the paper’s findings (rank‑2 geometry of LLM evals) and innovations (BENCHPRESS: logit‑space rank‑2 ALS matrix completion, probe-set selection, reliability/uncertainty layer). Each item lists target sectors, potential tools/products/workflows, and key assumptions/dependencies that affect feasibility.

Immediate Applications

  • Benchmark cost reduction via probe-based scorecard recovery — software/AI, cloud, MLOps, academia BENCHPRESS can reconstruct a model’s public scorecard to within ≈4–5 raw points from 5 probe benchmarks (3.93 MedAE with top probes; 4.55 with low-cost probes), letting teams run a handful of tests instead of 40+. Tools/products/workflows: Integrate microsoft/benchpress in CI to auto-run {GPQA-D, HLE, Codeforces, MMLU‑Pro, ARC‑AGI‑1} or low-cost {GPQA‑D, MMLU‑Pro, Aider Polyglot, MATH‑500, AIME 2026} probes, then impute the rest; dashboard with predicted scores and 90% conformal intervals. Assumptions/dependencies: Rank‑2 structure holds for your model family; probe scores are available; public-score heterogeneity means predictions reflect the public matrix rather than a fully standardized re-eval.
  • Rapid model triage during training and ablation sweeps — software/AI R&D, academia Use a few probes per checkpoint to approximate broad evals, identify promising runs, and cut repeated full-suite executions. Tools/products/workflows: Auto-eval “triage step” in training pipelines; schedule full eval only for checkpoints that pass predicted-score thresholds. Assumptions/dependencies: Similarity of in-progress checkpoints to matrix models; structured missingness may bias estimates when experimenting far from current frontier models.
  • Model selection and procurement shortlisting — enterprise IT, finance, healthcare, government, education Shortlist models for pilots using predicted per-benchmark scores and preserved rankings (≈92% pairwise ranking preservation within ±5 points). Tools/products/workflows: Vendor-neutral selection dashboards with predicted scores + uncertainty; RFP templates that accept “probe+impute” evidence for initial down‑selection. Assumptions/dependencies: Regulatory context may require subsequent verification on high-stakes tasks; predictions carry measurement noise from sources.
  • Leaderboard maintenance and gap-filling — evaluation platforms, communities, media Fill missing cells in public leaderboards to keep comparisons current as new models/benchmarks appear. Tools/products/workflows: “Impute leaderboard” mode with confidence bands; automatic flagging of low-reliability cells for human review. Assumptions/dependencies: Transparency about imputed vs. measured entries; community acceptance of provisional rankings.
  • Budget-aware evaluation planning — MLOps, academic labs, startups Plan an eval budget with low-cost probe allowlists that achieve near-best recovery accuracy. Tools/products/workflows: Budget planners that recommend the next best probe per dollar; cost-aware greedy selection embedded in eval harnesses. Assumptions/dependencies: Probe costs and harness availability; low-cost list must be refreshed as the matrix evolves.
  • Predicting newly released models from seed scores — vendors, integrators, analysts With ~5 seed scores, BENCHPRESS attains ≈5.0 MedAE on models released after the training snapshot, enabling quick capability previews. Tools/products/workflows: “Fast launch readout” for model releases; partner enablement kits that generate provisional scorecards from small seed evals. Assumptions/dependencies: New model not radically out-of-distribution relative to the matrix; recency of anchors matters for reliability.
  • Reliability-gated decision-making — safety teams, risk & compliance, platform ops Use BENCHPRESS’s trust probabilities and conformally calibrated 90% intervals to gate actions (e.g., require full eval if the interval is wide). Tools/products/workflows: Policy rules: “If interval width > X or trust < τ, auto‑schedule full eval”; exception queues for manual review. Assumptions/dependencies: Calibration depends on matrix composition; uncertainty estimates degrade if peer coverage is very low.
  • Detection of novelty and drift in capability profiles — research, eval ops Low trust signals (few peers, weak neighbor benchmarks, stale anchors) flag potential novelty or distribution shift and trigger expanded testing. Tools/products/workflows: “OOD lite” monitor using ensemble spread and coverage signals; alerting to run diverse or new benchmarks. Assumptions/dependencies: Novel capability axes beyond rank‑2 reduce predictability; requires periodic matrix refresh.
  • Benchmark portfolio pruning and de-duplication — benchmark maintainers, academic consortia Identify redundant benchmarks (highly predictable from others) and prioritize unique ones for routine reporting. Tools/products/workflows: Correlation maps and per-benchmark predictability scores; governance to retire or rotate redundant tests. Assumptions/dependencies: Redundancy at current snapshot; future tasks may restore uniqueness.
  • Course and lab evaluation on a budget — education, bootcamps Teach evaluation best practices with probe+impute to conserve compute while exposing students to multi-benchmark analysis. Tools/products/workflows: Classroom kits using the released dataset and code; assignments on probe selection and uncertainty. Assumptions/dependencies: Institutional acceptance of predicted scores for instruction; compute access to run a few probes.
  • Marketplace discovery and pricing signals — cloud model hubs, API providers Show estimated performance on unreported benchmarks to improve search, filtering, and price/performance comparisons. Tools/products/workflows: Model cards augmented with predicted scores and confidence bars; “estimated ranking” badges. Assumptions/dependencies: Clear labeling of predicted vs. measured; legal/marketing review for claims.
  • Audit triage for provider-reported scores — independent labs, journalists When provider claims diverge from BENCHPRESS predictions beyond calibrated intervals, prioritize those cells for reproduction. Tools/products/workflows: “Claim deviation” tracker; allocation of limited audit funds to high-discrepancy, high-impact cells. Assumptions/dependencies: Public-score heterogeneity (prompting, harnesses) can cause benign divergences; requires careful forensic checks.

Long-Term Applications

  • Adaptive, closed-loop evaluation orchestration — MLOps, platform engineering Systems that choose the next best benchmark or even next best items adaptively based on value of information and uncertainty, combining BENCHPRESS with item-level methods (IRT, Scales++, etc.) for end-to-end 10–100× cost savings. Tools/products/workflows: “Evaluation autopilot” that mixes probe selection and item condensation; SLAs on ranking accuracy with minimal cost. Assumptions/dependencies: Robust integration of benchmark- and item-level selection; reliable uncertainty estimation across evolving model families.
  • Standards for “predict-then-verify” compliance — regulators, auditors, policy Regulatory frameworks that allow probe+impute for preliminary conformance, followed by targeted verification on critical benchmarks. Tools/products/workflows: Compliance checklists that specify approved probe sets and minimum trust thresholds; randomized spot checks. Assumptions/dependencies: Policy acceptance; sector-specific safety constraints (e.g., healthcare) may mandate full measurement.
  • Data-driven benchmark design and investment — funders, research consortia Use residuals and factor geometry to identify new benchmarks that add orthogonal information (break predictability) and retire redundant ones. Tools/products/workflows: “Orthogonality score” and expected information gain dashboards to guide new benchmark development. Assumptions/dependencies: Stable estimation of latent factors as the ecosystem grows; community coordination.
  • Capability progress forecasting and roadmapping — strategy teams, analysts Track latent factors over time to forecast benchmark outcomes and plan compute and research investments. Tools/products/workflows: Time-series models over factor scores; scenario planning tools for “if factor A improves by X%…”. Assumptions/dependencies: Factor stability and stationarity; new modalities or training regimes may shift geometry.
  • Enterprise-grade evaluation governance — risk, procurement, legal Formalize evaluation tiers (predictive screening, targeted verification, full audit) with cost, latency, and assurance trade-offs. Tools/products/workflows: Policy engines that map business criticality to required eval tier; automated enforcement in deployment gates. Assumptions/dependencies: Alignment between business risk and acceptable uncertainty; periodic recalibration with incident learnings.
  • Provisional leaderboards with uncertainty — community platforms, media Persistent leaderboards that display imputed scores with conformal intervals and confidence ratings, updating as new data arrives. Tools/products/workflows: “Nowcasting” leaderboards; APIs serving point estimates and intervals; provenance tracking for reproducibility. Assumptions/dependencies: Norms around reporting predicted results; clear visual and textual disclaimers.
  • Sector-specific pre-certification — healthcare, finance, public sector Use probe+impute to pre-screen models for non-safety-critical workloads (e.g., back-office summarization), reserving full eval for mission-critical use. Tools/products/workflows: Domain-specific probe bundles; integration with red-teaming and safety checklists. Assumptions/dependencies: Regulatory boundaries; domain shift between public benchmarks and sector data.
  • Model development loop optimization — frontier labs Incorporate predicted cross-benchmark deltas into early stopping, curriculum decisions, and resource allocation for large-scale training. Tools/products/workflows: “Predicted gains” dashboards for checkpoints; auto-prioritization of data/architecture variants to validate physically. Assumptions/dependencies: Predictive validity for models mid-training; sensitivity to novel training techniques.
  • Drift and novelty sentinels for production LLM ops — platform ops Monitor prediction residuals over time to detect changes in model behavior or evaluation conditions, triggering re-benchmarking. Tools/products/workflows: Residual-based anomaly detection; feedback loops to refresh probe sets and retrain BENCHPRESS on new snapshots. Assumptions/dependencies: Sufficient fresh ground-truth to recalibrate; robust handling of changing eval harnesses.
  • Marketplace pricing and SLAs tied to estimated performance — cloud/providers Use predicted performance on key suites to inform pricing tiers and performance SLAs where full measurement is prohibitively costly. Tools/products/workflows: “Estimated SLA” calculators; customer-facing guidance for expected quality on unreported tasks. Assumptions/dependencies: Contractual acceptance of estimates; mechanisms to reconcile when measured outcomes diverge.
  • Cross-modal and safety expansion — multimodal AI, safety research Extend low-rank prediction and reliability tooling to multimodal, tool-use, and safety/hazard evaluations as datasets mature. Tools/products/workflows: Multimodal score matrices; safety-specific probe bundles; conservative uncertainty thresholds. Assumptions/dependencies: Comparable scoring scales and sufficient coverage; evolving tasks may be less low-rank.

Notes on global assumptions across applications:

  • Low-rank geometry is snapshot-dependent; new capability profiles or modalities can break rank‑2 structure, requiring re-derivation of probe sets and retraining the predictor.
  • Public-score heterogeneity (prompts, harnesses, dates) injects noise; predictions are upper bounds relative to fully standardized re-evals.
  • Structured missingness (popular models/benchmarks overrepresented) can bias completion; reliability layer should be used to gate high-stakes decisions.
  • Benchmarks remain essential for discovering failure modes, data contamination, and distribution shifts; prediction complements, not replaces, measurement.

Glossary

  • Alternating least squares (ALS): An iterative optimization method that factorizes a matrix by alternating closed-form least-squares updates for its factors. "alternating least squares (ALS) matrix-completion method in logit space (Koren et al., 2009)."
  • Arcsinh: A nonlinear transform that smoothly compresses large values and is defined at zero; used to reshape percentage scores. "Arcsinh. Apply arcsinh(s/50), a smooth approximation to log that is defined at zero."
  • Benchmark-KNN (Bench-KNN): A k-nearest neighbors approach that predicts a score using the k most correlated benchmarks as neighbors. "Benchmark-KNN (Bench-KNN). For each missing entry, find the k benchmarks most correlated1 with the target benchmark and predict from the model's observed scores on those neighbors, using correlation-based weights."
  • Bias-decomposed alternating least squares (ALS): An ALS variant that models global, row, and column biases plus a low-rank residual interaction. "Bias-decomposed alternating least squares (ALS) (Koren et al., 2009)."
  • Coefficient of determination (R2): A goodness-of-fit metric indicating the fraction of variance explained by a model. "We use the coefficient of determination R2 = 1 - SSE/SST, where SST = ; (yi - y)2 is the total variance of the target around its mean y and SSE = Li(yi - yi)2 is the residual variance left by the fit, with yi the values we are trying to predict..."
  • Collaborative filtering: A prediction technique that infers missing entries using patterns across similar users/items; here applied to models/benchmarks. "Zhang et al. (2024) applied collaborative filtering to LLM scores;"
  • Conformally-calibrated prediction intervals: Statistically calibrated ranges for predictions that achieve a target coverage rate under minimal assumptions. "to estimate trust probabilities and conformally-calibrated 90% prediction intervals for BENCHPRESS predictions (Section 6)."
  • Ensemble spread: A reliability signal capturing disagreement across multiple plausible predictors. "together with ensemble spread, a reliability signal measuring how much plausible score predictors disagree, to estimate trust probabilities..."
  • Item Response Theory (IRT): A psychometric framework modeling item difficulty and respondent ability to select informative subsets. "IRT-based methods include MetaBench (Kipnis et al., 2025), which uses item response theory to keep 3% of items across six benchmarks while preserving aggregate conclusions,"
  • Logit transform: A nonlinear mapping of bounded percentages to an unbounded scale, separating values near 0 and 100. "Logit. Apply log(s/(100 - s)), mapping the bounded score range to an unbounded scale that symmetrically spreads apart scores near both 0 and 100."
  • Low-rank completion: Filling in missing matrix entries by assuming the data lies near a matrix of small rank. "the score matrix should be predictable from a low-rank completion."
  • Matrix completion: The task of inferring missing matrix entries from observed ones under structural assumptions. "Soft-Impute (Mazumder et al., 2010), a standard matrix-completion method that alternates between filling missing entries and taking a low-rank SVD approximation."
  • Median absolute error (MedAE): The median of absolute prediction errors measured in the original score units. "With only five benchmark probes selected on the current matrix, pooled MedAE drops to 3.93 score points"
  • Median absolute percentage error (MedAPE): The median of absolute percentage errors, robust to heavy-tailed error distributions. "We evaluate on held-out entries using Median Absolute Percentage Error (MedAPE), the median of absolute percentage errors |predicted - true|/ |true| × 100%."
  • Mean-centering: Subtracting column means so leading components reflect variation rather than average levels. "We mean-center each benchmark column before computing SVD, so that the leading component reflects directions of model variation rather than the shared average score level"
  • Model-KNN: A k-nearest neighbors method that predicts a score using the closest models by distance over shared benchmarks. "Model-KNN. Find the k models closest to the target model by root-mean-square distance over shared observed benchmarks, then average their scores on the target benchmark."
  • Non-negative Matrix Factorization (NMF): A factorization technique restricting both factors to be non-negative, often for interpretability. "NMF. Non-negative matrix factorization (Lee and Seung, 1999), constraining both factors to be non-negative."
  • Nuclear norm minimization: A convex relaxation of rank minimization that penalizes the sum of singular values. "Nuclear norm minimization. Convex relaxation of rank minimization (Candès and Recht, 2009): minimize the nuclear norm of the completed matrix plus a squared-error fit on observed entries,"
  • PCA (Principal Component Analysis): A dimensionality reduction method based on the SVD that extracts dominant variance directions. "via PCA, consistent with the rank-2 geometry we recover in Section 3.3 on a different (heterogeneous, frontier-era) matrix."
  • Pearson correlation: A measure of linear association between two variables, ranging from −1 to 1. "Throughout the paper, 'correlation' refers to the Pearson correlation: for two columns a, b of length n, p(a, b) = ..."
  • Probabilistic Matrix Factorization (PMF): A Bayesian factorization model with Gaussian priors on latent factors. "PMF. Probabilistic matrix factorization (Mnih and Salakhutdinov, 2008) with Gaussian priors on both factors."
  • Probit transform: A nonlinear mapping using the inverse standard normal CDF to unbound percentage scores. "Probit. Apply ¢-1(s/100), where + is the standard normal CDF."
  • Quantile transform: A non-parametric mapping replacing scores with within-benchmark ranks to achieve uniform marginals. "Quantile. Replace each score with its within-benchmark rank divided by n + 1, producing uniform marginals."
  • Regularization: A penalty term (often λ) used to prevent overfitting in factorization or regression. "We fix the rank to 2 following the held-out rank sweep in Section 3.3; the only tunable hyperparameter is the regularization A."
  • Root-mean-square (RMS) distance: A distance metric based on the square root of mean squared differences across dimensions. "Find the k models closest to the target model by root-mean-square distance over shared observed benchmarks,"
  • Singular Value Decomposition (SVD): A matrix factorization into singular vectors and singular values, used for low-rank approximation. "Singular Value Decomposition (SVD) of fully-observed submatrices shows matching rank-2 geometry."
  • Soft-Impute: An iterative matrix completion algorithm alternating between low-rank SVD and re-imputation. "Soft-Impute (Mazumder et al., 2010) iterates between SVD truncation at a chosen rank and re- imputation of missing entries until convergence."
  • Stable rank: An effective rank measure indicating how concentrated a matrix’s singular values are. "the stable rank | M |? / s2 measures effective dimensionality: values near 1 mean that one component dominates."
  • Standardization: Scaling columns to zero mean and unit variance before modeling to normalize features. "After applying the chosen transform, we standardize each benchmark column to zero mean and unit variance, so every prediction method operates entirely in the transformed, standardized space."
  • Structured missingness: Non-random patterns of missing data that can bias analyses and invalidate assumptions. "Structured missingness. Popular models x popular benchmarks are over-represented, violating the uniform sampling assumption underlying standard matrix completion guarantees."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 152 likes about this paper.