Create a Video View Paper

You Don't Need to Run Every Eval

Modern frontier language models are evaluated on dozens of benchmarks, each run carrying significant cost and time. This presentation reveals a surprising mathematical structure hidden in benchmark scores: a rank-2 latent space that governs performance across all tasks. By exploiting this low-dimensional geometry, the authors introduce BENCHPRESS, a matrix completion method that recovers full evaluation scorecards from just five strategically chosen benchmarks with median error under 4 points, cutting evaluation overhead by over 95% while preserving model rankings 92% of the time.

Script

Evaluating a single frontier language model costs thousands of dollars and days of compute time, repeated across 50 or 100 different benchmarks. But the authors discovered something remarkable: all those scores are secretly governed by just two numbers.

When they assembled scores from 84 frontier models across 133 benchmarks, the resulting matrix turned out to be effectively rank 2. Over 90% of all performance variance collapses onto a two-dimensional plane, which means benchmark scores are far from independent.

BENCHPRESS exploits this structure using logit-transformed alternating least squares. It decomposes scores into a global baseline, per-model and per-benchmark offsets, plus a rank-2 residual correction. The result is full scorecard recovery with a median error of just 4.63 points.

Using greedy selection, just five benchmarks are enough to recover an entire scorecard. GPQA Diamond, HLE, Codeforces, MMLU-Pro, and ARC-AGI-1 form an optimal probe set that predicts the rest with 3.93 point median error, cutting evaluation cost by over 95 percent while preserving pairwise model rankings 92 percent of the time.

Not all predictions are equally trustworthy. The authors built a hybrid reliability estimator that flags which inferences are safe to use. For the most reliable 20 percent of predictions, median error drops to just 1.83 points, while harder cases show wider spreads and less matrix support.

The redundancy in benchmark evaluation isn't just wasteful, it's mathematically unnecessary. By running a handful of informative probes and letting matrix completion fill in the rest, practitioners can deploy models faster and cheaper without sacrificing ranking fidelity. To dive deeper into this work and create your own video explanations, visit EmergentMind.com.