Efficient Benchmarking Is Just Feature Selection and Multiple Regression

Published 25 May 2026 in stat.ML, cs.AI, cs.CL, and cs.LG | (2605.25773v1)

Abstract: Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information-theoretic feature-selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data-poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman $ρ$ and Kendall $τ$) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at https://github.com/sambowyer/mrmr_eval .

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents efficient LLM benchmarking as a regression and feature selection challenge using mRMR and kernel ridge regression.
It empirically demonstrates that mRMR++ achieves lower RMSE, higher ranking correlation, and significant compute savings compared to traditional methods.
The approach offers practical and theoretical benefits, enabling robust, scalable LLM evaluation with minimal coresets and improved representativity.

Recasting Efficient LLM Benchmarking as Feature Selection and Multiple Regression

Problem Setting and Motivation

The exponential growth in LLM size and complexity has rendered standard evaluation procedures computationally infeasible, especially when assessing models on large-scale benchmarks involving hundreds to thousands of examples, or for settings requiring repeated ablations and checkpoint evaluations. The primary challenge is to efficiently estimate a model's performance on a full benchmark using only a subset ("coreset") of the benchmark's questions—for both practical (compute savings) and theoretical (representativeness, generalizability) reasons.

This paper conceptualizes efficient benchmarking as a classic statistical problem: regression with feature selection. Formally, the goal is to (1) select a subset of benchmark questions (feature selection), and (2) learn a predictive function mapping responses on the coreset to the full benchmark score (multiple regression). The authors argue—and demonstrate empirically—that principled application of feature selection via minimum redundancy maximum relevance (mRMR) and kernel ridge regression (KRR) yields superior results compared to the state-of-the-art, including methods based on clustering, item response theory (IRT), sampling, and adaptive selection.

Methodology: mRMR Feature Selection and Kernel Ridge Regression

Feature Selection via mRMR

Feature selection is performed using the mRMR algorithm, which optimizes the mutual information quotient (MIQ): select features/questions that maximize their mutual information with the target (full benchmark score, denoted $Y$ ), while minimizing redundancy among selected features. The MIQ objective is greedy, iteratively extending the coreset by the question whose relevance to $Y$ divided by average redundancy to the current coreset is maximal.

The paper implements both discrete-discrete and discrete-continuous MI estimation (utilizing Kraskov–Stögbauer–Grassberger, PCA-corrected estimators, and Ross's algorithm), with empirical tuning of the nearest-neighbors parameter $k$ . Binary-performances use empirical estimation over model responses; continuous metrics are handled via adapted estimators.

Prediction via Ridge and Kernel Ridge Regression

Regression is performed using classical ridge regression and kernel ridge regression (KRR) with polynomial kernels up to degree $d$ , enabling the model to capture not only additive effects of individual questions, but also synergistic multiplicative effects (e.g., $d=2$ enables interaction terms between pairs of questions).

Prediction is performed on new/test models using the learned mapping from coreset scores to the mean or summary benchmark score. Cross-validation is used for regularization parameter selection.

Comparative Evaluation and Results

The method is evaluated across multiple datasets:

Binary Metrics: Open-LLM-Leaderboard and HELM-Lite, comprising hundreds of models evaluated on benchmarks like IFEval, OpenLLM-Math, MMLU-Pro, ARC-Challenge, BBH, GPQA, MuSR, CommonsenseQA, GSM8K, LegalBench, MATH, Med-QA, MMLU.
Continuous Metrics: Summarization datasets (GovReport, BioLaySumm), with metrics such as ROUGE-L, BERTScore, and FKGL.
Pass@ $k$ Metrics: Python coding benchmarks (MBPP, MBPP+, LBPP, HumanEval) with varying $k$ .

Strong baselines include random sampling, exhaustive random search, K-means clustering-based approaches (AnchorPoints), IRT-driven methods (gp-IRT, Metabench), Lasso regression, and variants using alternative regression methods.

Strong Numerical Claims

mRMR/KRR Outperforms Competitors: Across nearly all settings (binary, continuous, pass@ $k$ ), mRMR++ (MIQ variant with KRR, $d=2$ ) achieves lower RMSE and MAE, as well as higher ranking correlations (Spearman $\rho$ , Kendall $Y$ 0), particularly with moderate source-model pool sizes ( $Y$ 1).
Enhanced Ranking Correlation: Even with small coresets (e.g., 5% of the benchmark), ranking correlations remain high—indicating robust model ordering and comparability.
Speed and Stability: mRMR-based coreset construction is orders-of-magnitude faster than clustering or IRT inference, with greater stability (measured via $Y$ 2 and Hamming metrics)—the coreset shows high overlap across seeds and data splits.
KRR Improves Other Methods: Kernel ridge regression, when applied to AnchorPoints/TinyBenchmarks-generated coresets, improves their predictive performance—even outperforming their original aggregation strategies.
Quadratic Kernels Optimal: Degree $Y$ 3 polynomial kernels strike a balance between capturing pairwise effects and avoiding overfitting, outperforming higher-degree variants.
Coreset Difficulty: mRMR selects questions of intermediate difficulty, focusing on maximal information content rather than broad representativity.
Figure 1: mRMR++ consistently delivers lower RMSE and higher Kendall $Y$ 4 compared to baselines, for both binary and pass@ $Y$ 5 benchmarks.

Figure 2: For $Y$ 6, mRMR++ dominates across coreset sizes in prediction error and ranking correlation; performance degrades for very small $Y$ 7.

Figure 3: mRMR selects questions with a balanced spectrum of difficulty compared to competitor methods; the stability and timing advantages are notable.

Figure 4: True vs. predicted summary scores on ARC-C, showing that kernel ridge regression yields low training RMSE and reduced prediction variance.

Analytical Insights and Ablations

Extensive ablation experiments cover:

MI Estimator Parameters: Optimal $Y$ 8 for binary, $Y$ 9 for continuous; MIQ variant outperforms MID, and F-statistic-based selectors underperform.
Regression Method Variants: KRR ( $k$ 0) outperforms alternatives (Lasso, Random Forests, logit transformation).
Relevance Target Options: Using summary score as the target is optimal; first principal component alternatives perform worse.
Underspecification Effects: Performance gains from polynomial ridge regression diminish as $k$ 1 decreases—underspecified settings reduce signal-to-noise.
IRT Dimensionality: For binary, $k$ 2 is generally best; for continuous, results are less sensitive, but higher-dimensional continuous beta-IRT shows instability.
Figure 5: Optimality of quadratic kernels ( $k$ 3) for all methods; higher degrees degrade prediction.

Figure 6: mRMR scheme and MI estimator hyperparameter $k$ 4 ablations: MIQ and PCA-KSG estimators yield best performance.

Figure 7: mRMR with MIQ achieves superior relevance-redundancy ratios for coreset selection.

Practical and Theoretical Implications

The approach provides substantial compute savings, enabling rigorous model evaluation at scale for in-distribution settings such as training checkpoint tracking or inference ablation studies. The focus on static coresets is motivated by practical needs—batch processing, interpretability, and throughput. By reframing efficient benchmarking through the lens of classic feature selection and regression, well-developed statistical tools from supervised learning theory become applicable.

From a theoretical standpoint, these results challenge the necessity—and complexity—of more elaborate clustering or IRT-based approaches for the majority of real-world LLM evaluation workloads. The demonstrated generalization and stability of mRMR/KRR-based coresets, even to related metrics (pass@ $k$ 5 for varying $k$ 6), indicate powerful representational properties of the selected coreset.

Figure 8: Performance benefits from polynomial ridge regression are sensitive to underspecification ratios $k$ 7.

Future Directions

Unsolved issues and future avenues include:

Continuous Metrics: mRMR’s performance on continuous datasets remains sub-optimal; enhancements may be feasible via refined MI estimators or alternative feature importance functions.
Out-of-Distribution and Adaptive Settings: Extending the approach to OOD scenarios (test models outside training distribution) or adaptive benchmarking strategies (model-specific question selection) demands additional theoretical and algorithmic advances.
Data Augmentation: Integrating alternative data modalities (activations, multiple generations, in-context learning effects) may further enhance coreset utility.

Conclusion

This work demonstrates that efficient benchmarking for LLMs is well-captured by the intersection of information-theoretic feature selection (mRMR, especially MIQ) and kernel ridge regression. These techniques consistently outperform competitive alternatives across diverse metrics and benchmarks, particularly for binary-scored datasets. The proposed framework delivers compute-efficient, stable, and generalizable evaluation with minimal complexity, opening robust pathways for further research in scalable LLM benchmarking.

Figure 9: Optimization of MIQ objective during mRMR coreset construction directly enhances prediction error and ranking correlation.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is about testing LLMs more quickly and cheaply. Instead of running a model on every single question in a big benchmark (which can take a lot of time and computing power), the authors show how to pick a small set of questions and still predict the model’s overall score very accurately. Their main message is simple: efficient benchmarking can be done well by treating it like two classic steps in statistics—pick useful “features” (questions) and learn a prediction formula (multiple regression).

What questions were the researchers trying to answer?

They wanted to know:

Can we choose a small group of questions that are especially good at telling us how a model will do on the whole benchmark?
Given a model’s answers on just those questions, can we accurately predict its full score and how it ranks compared to other models?
Will this work for different kinds of scores (like right/wrong, or continuous scores such as ROUGE or pass@k), and can it be fast and reliable?

How did they do it?

Think of a benchmark as a big exam with lots of questions. The authors split the problem into two stages:

Stage 1: Pick a small “coreset” of questions that are most useful.
Stage 2: Use the answers on that coreset to predict the model’s overall exam score.

Here’s how they handled each stage:

Picking the questions (feature selection with mRMR)

Feature selection means choosing the most informative items from a larger set. Here, each question is treated as a “feature.”
mRMR stands for Minimum Redundancy Maximum Relevance:
- Maximum relevance: choose questions that closely relate to the overall benchmark score. In everyday terms, these questions are strong clues to how the model will do overall.
- Minimum redundancy: avoid questions that duplicate each other’s information. Don’t pick five near-identical questions; pick diverse ones.
To measure “how related” two things are, they use mutual information, a number that tells you how much knowing one thing helps you predict another. High mutual information means a question is very helpful; low redundancy means questions aren’t repeating the same signal.

Predicting the overall score (regression)

Regression is a way to learn a formula that predicts an outcome (the full score) from inputs (the coreset question scores).
Ridge regression: a sturdy, simple method that prevents overfitting by shrinking overly large weights. Think of it as a balanced recipe that doesn’t let any single ingredient dominate too much.
Kernel ridge regression: an upgraded version that can learn from combinations of questions (not just each question alone). It’s like noticing that doing well on Question A and Question B together tells you more than either alone. The “kernel” trick lets the method consider these interactions without making the computation complicated.

They also compared their approach against other popular methods, like:

AnchorPoints (choosing representative questions by clustering),
TinyBenchmarks/gp-IRT (using Item Response Theory, a model from testing/psychometrics),
Lasso (another regression method that picks features by making some weights exactly zero),
Random selection and exhaustive random search baselines.

What did they find, and why is it important?

Here are the main results:

Using mRMR to pick the questions and kernel ridge regression to predict the score (they call this mRMR++) gave the best performance in many cases.
It consistently had smaller prediction errors and better agreement with the true ranking of models (high Spearman and Kendall rank correlations), especially on benchmarks with right/wrong scoring.
Kernel ridge regression also improved older methods: swapping in kernel ridge (instead of their original predictors) made AnchorPoints and similar approaches more accurate.
mRMR was fast. It ran much quicker than methods that need heavy probabilistic modeling or large clustering steps.
mRMR was stable. Across different random seeds or training splits, it tended to pick the same questions more often, which means the selected coreset is reliable.
It worked on both kinds of metrics:
- Binary scores (correct/incorrect).
- Continuous scores (like ROUGE-L, BERTScore, and pass@k in coding).
For pass@k (coding tasks), coresets chosen with pass@1 often helped predict other pass@k values too, especially when using kernel ridge.
Caveat: In very data-poor settings (too few source models to learn from), some IRT-based methods could have lower error, though they were less stable and worse at ranking.

Why this matters: Testing LLMs is expensive, especially on complex tasks. These results mean you can evaluate models faster and cheaper while still getting trustworthy scores and rankings. That saves computing resources and lets researchers iterate on training and settings more quickly.

What’s the impact?

Practical savings: You can test models using only 5–15% of the original questions and still get strong predictions of the full score.
Better workflows: Faster, stable evaluations help during model training, ablation studies, and tuning (like temperature or prompts), because you get good feedback sooner.
Broad applicability: The approach works across different benchmarks and scoring styles, making it easy to adopt.
Simple tools, strong results: mRMR and kernel ridge regression are well-known, easy to implement, and do not require complex modeling. Just smarter question selection plus a better prediction formula gets you most of the gains.

In short, the paper shows that efficient benchmarking can be treated as “pick the best questions” plus “fit a good regression,” and doing so delivers accurate, fast, and stable evaluations of LLMs without needing to run every single benchmark question.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, actionable list of what remains missing, uncertain, or unexplored in the paper.

Data requirements and sample complexity: The approach relies on per-question scores from M source models and degrades in data-poor settings; quantify minimal M versus coreset size n needed to achieve target error or rank-correlation thresholds, and derive sample-complexity curves to guide practitioners.
Out-of-distribution generalization: The method shows an S-shaped bias (overpredicting very weak models and underpredicting very strong ones) and is not recommended for models far outside the source-model performance range; develop OOD-robust predictors (e.g., importance weighting, covariate-shift correction, conformal calibration) and evaluate on frontier models.
Continuous-metric feature selection: On non-pass@k continuous metrics, mRMR coresets underperform brute-force search; investigate alternative selection objectives better suited to continuous targets (e.g., HSIC-based criteria, submodular mutual information, Fisher kernels, or regression-aware forward selection).
Mutual information (MI) estimation with few samples: The MI estimators (KSG and mixed discrete–continuous) can be biased/unstable for small M and sensitive to the k-NN parameter; assess estimator robustness, try parametric or shrinkage MI, copula-based MI, or discretization schemes with bias correction, and provide guidance for estimator/hyperparameter choice.
Pass@k generalization strategy: Coresets built on pass@1 only partially generalize across k; design multi-k-aware coresets that jointly optimize predictive performance across k∈{1,…,64}, or model pass@k as a parametric function of pass@1 to improve transfer.
Cost-aware coreset selection: The method assumes uniform per-question cost; extend selection to incorporate heterogeneous evaluation costs (e.g., code execution, multi-turn interactions) to maximize information per unit compute/latency.
Coverage and capability diversity: mRMR tends to select average-difficulty items; introduce explicit coverage constraints (e.g., across subjects/capabilities in MMLU) or diversity-aware objectives (e.g., submodular facility location, DPPs) and quantify capability coverage.
Stability guarantees: While empirical stability is reported, there are no theoretical guarantees; analyze stability of greedy mRMR under sampling noise and derive conditions under which selected coresets remain invariant as M or the training pool changes.
Theoretical performance bounds: Provide generalization/error bounds for ridge/kernel ridge predictors in this setting (as a function of M, n, kernel degree d, and score noise), including conditions for when polynomial features help versus overfit.
Handling stochastic/heteroscedastic noise: Per-question scores can be noisy (e.g., pass@k variance, summarization metric instability); incorporate uncertainty into selection and prediction (e.g., variance-aware MI, weighted regression, errors-in-variables models) and measure robustness to label noise.
Missing data robustness: Leaderboard datasets often have missing per-model per-question entries; develop imputation strategies or selection/prediction methods robust to missingness mechanisms, and quantify the impact on stability and error.
Feature representation beyond per-model scores: The method represents questions only by historical model scores; explore augmenting with question text embeddings, metadata, difficulty proxies, or activation features to reduce dependence on large M and improve cold-start for new benchmarks.
Model class exploration: Beyond (kernel) ridge, assess non-linear predictors (e.g., GAMs with interactions, tree ensembles, GPs) and models with heteroscedastic errors; establish model selection protocols that remain reliable under tiny-sample regimes.
Hyperparameter selection under small M: Cross-validation for λ (regularization) and kernel degree d may be unreliable with few models; investigate Bayesian/empirical-Bayes tuning, marginal-likelihood selection, or nested CV variants designed for low M.
Joint multi-benchmark coresets: Benchmarks are treated independently; study whether cross-benchmark coresets can share items/skills to amortize evaluation cost (e.g., jointly selecting items across related benchmarks) and compare to IRT-based multi-benchmark methods without relying on IRT.
Dynamic updates and benchmark drift: Coresets are static; design incremental update rules that refresh coresets as new models/versions arrive or as benchmarks evolve, and add drift detection to trigger re-selection.
Uncertainty quantification for decisions: Provide calibrated confidence intervals for predicted scores and ranks (e.g., via bootstrap, Bayesian ridge, or conformal methods) and evaluate calibration quality for model comparisons.
Fairness across model families: Assess whether coresets overfit to the distribution of training-model families (e.g., open vs. closed, instruction-tuned vs. base); design selection that enforces fairness/coverage across families and audit subgroup performance.
Evaluation breadth and realism: Experiments omit long-horizon, interactive, or human-in-the-loop tasks (e.g., SWE-bench, multi-turn dialogue with human ratings); extend to these settings and adapt selection to path-dependent or subjective scoring.
End-to-end compute trade-offs: Report selection time but not full trade-offs (selection + coreset evaluation + prediction) under realistic batching/parallelism constraints; provide compute–accuracy/ranking curves to guide deployment.
Robustness to item perturbations: Test whether coresets remain informative under paraphrasing/rephrasing or adversarial edits; design selection robust to surface-form variations.
Interpretability of interactions: Kernel ridge with degree d=2 leverages pairwise interactions without interpretability; develop post-hoc analysis to identify influential questions and interactions driving predictions to support auditability.
Comparative breadth in feature selection: The paper does not compare against DPPs, facility-location-based selection, influence-function-based core sets, or determinantal submodular methods; include these baselines and analyze stability–performance trade-offs.
Label shift across pass@k values: Question difficulty shifts across k; formalize this as a domain shift and evaluate domain-adaptation corrections (e.g., reweighting or invariant selection) to improve cross-k generalization.
Human-evaluation budgets: Methods assume automatic metrics; adapt selection when labels come from humans with tight budgets by integrating active learning/uncertainty sampling and measuring gains in human-judgment settings.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are actionable use cases that can be deployed now using the paper’s mRMR-based coreset selection and (kernel) ridge regression approach, along with sector links, potential tools/workflows, and key assumptions that affect feasibility.

Training checkpoint triage and gating (software/ML infrastructure)
- What: Rapidly approximate full-benchmark scores for each training checkpoint using a small, static coreset; trigger early stopping or promote promising runs.
- Tools/workflows: “Eval-accelerator” module integrated into MLflow, Weights & Biases, LangSmith; CI/CD gating step that runs coreset evals on each checkpoint.
- Assumptions/dependencies: Access to per-question scores from M ≥ 15–30 source models on the same benchmark; test models are in-distribution relative to source models; kernel degree/cross-validation set up; predictions less reliable at performance extremes.
Efficient hyperparameter and inference-time sweeps (software)
- What: Use coreset predictions to rank sampling settings, prompt formats, or few-shot configurations before committing to full evaluations.
- Tools/workflows: Automated sweep controllers that use mRMR coresets + kernel ridge to prune the search space.
- Assumptions/dependencies: In-distribution sweeps; enough M/n for stable regression (e.g., coreset ≈ 5–15% of items); continuous metrics supported but may benefit from model selection between linear vs. quadratic kernels.
Enterprise model shortlisting and procurement (cross-sector: finance, healthcare, enterprise IT)
- What: Shortlist candidate LLMs with low-cost coreset evals while preserving rank orderings close to full-benchmark results.
- Tools/workflows: Procurement dashboard that runs coreset-based “screening” and reserves full evals for finalists.
- Assumptions/dependencies: Candidate models similar to source pool; legal ability to run benchmark queries; communicate uncertainty and use full evals for contractual decisions.
Continuous integration (CI) quality gates for model releases (software vendors/platforms)
- What: Regressions/quality checks on a stable, repeatable mRMR coreset per benchmark before release.
- Tools/workflows: GitHub Actions/Buildkite step invoking mRMR++ predictions; fail/pass thresholds on rank correlations (Kendall τ/Spearman ρ) or MAE/RMSE.
- Assumptions/dependencies: Stable coreset reproducibility; periodic refresh if data distribution drifts.
Cheaper code-generation evaluation using pass@1 coresets to estimate pass@k (software/DevTools)
- What: Build coresets on pass@1 and reuse them to predict pass@k (k ∈ {2,4,…,64}) to reduce runs and generations.
- Tools/workflows: IDE/CI plugin that estimates pass@k from fewer generations; benchmarking scripts calling kernel or linear ridge on coreset results.
- Assumptions/dependencies: Coreset generalizes across k; models are in-distribution; ensure enough generations for low-variance pass@1 signals.
Benchmark curation and maintenance (academia/open-source leaders)
- What: Use mRMR to identify maximally informative, minimally redundant questions; curate “tiny” yet representative benchmark slices.
- Tools/workflows: Coreset builder CLI and notebooks (based on the paper’s GitHub) to refresh coresets as new source models arrive.
- Assumptions/dependencies: Sufficient source-model diversity; consider fairness and coverage—avoid overfitting to a narrow capability profile.
Compute budgeting and evaluation scheduling (energy/finance/operations)
- What: Use quick coreset predictions to decide whether a full, costly eval is warranted, optimizing compute spend and carbon budget.
- Tools/workflows: Budget-aware scheduler that triggers full runs only when predicted gains exceed a threshold.
- Assumptions/dependencies: Organization-defined error tolerance; track calibration errors and re-estimate when drift occurs.
Governance and audit sampling (regulated sectors: finance, healthcare, public sector)
- What: Stable, repeatable coresets act as audit samples for compliance checks and internal model reviews.
- Tools/workflows: Governance dashboards that store coreset definitions, predicted vs. true score deltas, and ranking stability over time.
- Assumptions/dependencies: Static coresets suit routine checks; document limitations (e.g., binary metrics strongest, continuous metrics require care).
Community leaderboard pre-screening (academia/open-source)
- What: Quickly estimate a model’s placement before committing to full, high-variance, multi-benchmark runs.
- Tools/workflows: Leaderboard integration that accepts coreset results and displays provisional standings with confidence notes.
- Assumptions/dependencies: Transparent caveats; top candidates still need full evaluation; avoid gaming by enforcing coreset secrecy or rotation.
Research triage for ablations and ideas (academia/industry R&D)
- What: Rank-order ablations and research ideas for likely performance impact across multiple benchmarks using coreset predictions.
- Tools/workflows: Meta-evaluation notebooks with cross-validated kernel degree selection (d=1 vs. d=2).
- Assumptions/dependencies: Sufficient source data; beware under/overfitting on small M/n; continuous metrics may benefit from linear kernels.
“Coresets-as-a-service” for common benchmarks (industry/academia)
- What: Distribute precomputed, stable coresets (and trained predictors) for popular benchmarks to reduce barrier to entry.
- Tools/workflows: Package registry with coreset IDs, seed, and mRMR settings; API for score prediction given coreset results.
- Assumptions/dependencies: Licensing of benchmarks; transparency on source model pool and update cadence.

Long-Term Applications

The following applications require further research, scaling, standardization, or integration with adjacent advances (e.g., adaptive evaluation, OOD prediction, uncertainty quantification).

Standardized efficient-evaluation protocols and certifications (policy/regulation)
- What: Bodies like NIST/ISO codify efficient benchmarking protocols (coreset selection, prediction, error reporting).
- Tools/workflows: Compliance templates, audit packs, certified coreset libraries.
- Assumptions/dependencies: Stakeholder consensus; rules for uncertainty, drift, and out-of-distribution (OOD) models.
Adaptive evaluation with early stopping (software/AI assurance)
- What: Combine mRMR-informed priors with adaptive test selection to dynamically tailor question choice per model and stop early when confident.
- Tools/workflows: Active-testing orchestrators with stopping rules and budget control.
- Assumptions/dependencies: New algorithms to merge static coresets with adaptive selection; robust uncertainty estimates.
OOD performance prediction for frontier models (academia/industry)
- What: Extend to prediction-powered inference or covariate-shift approaches to extrapolate from weaker source models to stronger test models.
- Tools/workflows: Hybrid predictors (mRMR++ + PPI) with guardrails and small calibration subsets.
- Assumptions/dependencies: Reliable small labeled sets for calibration; theoretical guarantees; careful bias control.
Multi-metric and cross-benchmark forecasting for safety/robustness (AI safety/assurance)
- What: Jointly predict across related metrics (e.g., toxicity, factuality, robustness) using multi-target kernel ridge and shared coresets.
- Tools/workflows: Safety dashboards with cross-metric error tracking and rank-correlation matrices.
- Assumptions/dependencies: High-quality, multi-metric source data; fairness and subgroup checks; calibrated uncertainty.
Energy- and carbon-aware evaluation planners (energy/sustainability)
- What: Optimize evaluation plans to minimize carbon while maintaining rank fidelity (Kendall τ/Spearman ρ) targets.
- Tools/workflows: Carbon-cost models integrated with coreset schedulers; green-compute reporting.
- Assumptions/dependencies: Accurate carbon accounting; organizational incentives and policies.
Drift-aware continual monitoring with coreset rotation (production ML across sectors)
- What: Detect benchmark- or data-drift and refresh coresets/predictors to retain accuracy in long-lived systems.
- Tools/workflows: Drift detectors coupled with automatic coreset regeneration (mRMR) and predictor retraining.
- Assumptions/dependencies: Monitoring infrastructure; retraining budget; versioning and auditability.
Regulated domain evaluation kits (healthcare, finance, legal)
- What: Domain-specific coreset kits aligned with regulatory expectations for coverage and repeatability.
- Tools/workflows: Packaged question sets, documentation, and reference predictors; evidence-ready reports.
- Assumptions/dependencies: Access to sensitive data under governance; IRB/ethics approvals; human oversight loops.
Semi-automated benchmark design (academia/benchmark publishers)
- What: Use mRMR signals to identify gaps (low relevance/high redundancy) and automatically generate new items to improve coverage.
- Tools/workflows: Item-generation assistants, MI-based diagnostics, iterative curation pipelines.
- Assumptions/dependencies: High-quality item generation (synthetic/human-in-the-loop); robust MI estimation in low-data regimes.
Federated/edge benchmarking with minimal data movement (edge/IoT/privacy)
- What: Send only coreset queries to edge models and predict overall performance server-side to reduce bandwidth and protect data.
- Tools/workflows: Federated evaluation protocols; privacy-preserving telemetry; on-device coreset caching.
- Assumptions/dependencies: Privacy constraints; secure transport; model heterogeneity management.
Procurement and SLA frameworks with uncertainty-aware predictions (enterprise/legal)
- What: Contractual SLAs based on predicted performance plus confidence bands; periodic full audits for validation.
- Tools/workflows: UQ-augmented predictors; legal templates for “predict-then-validate” procurement.
- Assumptions/dependencies: Trusted uncertainty quantification; legal acceptance; dispute resolution tied to full re-tests.
Educational modules on efficient evaluation and information theory (education)
- What: Curricula and labs demonstrating mRMR, MI estimation, and kernel ridge with real benchmarks, reducing compute needs for teaching.
- Tools/workflows: Interactive notebooks, lightweight datasets, GitHub tutorials.
- Assumptions/dependencies: Stable open datasets; institutional adoption; basic Python/ML literacy.

Cross-cutting assumptions and caveats

Data requirements: Method needs per-question scores from a set of source models (M typically ≥ 15–30 for strong results; performance degrades in very low-M regimes).
Distributional match: Best for in-distribution test models relative to the source pool; accuracy declines for models far below/above source performance.
Underspecification: Keep a reasonable M/n ratio; extremely large coresets relative to M reduce regression reliability.
Metric sensitivity: Binary metrics show strongest gains; for continuous metrics, compare linear vs. quadratic kernels via cross-validation; sometimes brute-force search baselines can be competitive.
Stability and refresh: Static coresets are stable and fast, but should be refreshed as benchmarks or model landscapes evolve.
Transparency: Communicate prediction error, rank-correlation, and known biases (e.g., slight S-shaped regression at extremes); use full evals for high-stakes decisions.

View Paper Prompt View All Prompts

Glossary

AnchorPoints: A clustering-based efficient benchmarking method that selects representative questions and uses a weighted mean for prediction. "Using kernel ridge regression also improves the performance of existing efficient benchmarking methods AnchorPoints and TinyBenchmarks."
BERTScore: An automatic evaluation metric that compares text generation to references using contextual embeddings. "For GovReport we have two metrics, ROUGE-L \citep{lin_automatic_2004} and BERTScore \citep{zhang_bertscore_2020}."
Beta-IRT: A continuous variant of Item Response Theory modeling scores on [0,1] via the Beta distribution. "To adapt gp-IRT to continuous data, we use the $\beta$ -IRT model from \citet{noel_beta_2007}, described in \cref{app:cirt}."
Coreset: A selected subset of questions used to estimate full-benchmark performance efficiently. "Select a coreset of questions $C \subset D$ of size $|C| = n < |D| = N$ ."
Cross-validation: A model selection technique that partitions data into folds to tune hyperparameters. "where $\lambda > 0$ is a regularisation coefficient (which we select via cross-validation) and $I_m$ is the $m \times m$ identity matrix."
Entropy: A measure of uncertainty in a random variable, used to define mutual information. "can be expressed in terms of the entropy of the variables $H(\cdot)$ (with joint entropy $H(\cdot, \cdot)$ and conditional entropy $H(\cdot | \cdot)$ ):"
Fisher information: A quantity measuring how informative an observation is about model parameters. "select the most informative question (in terms of Fisher information)."
GAM: Generalized Additive Model; a regression model where the response depends on a sum of smooth functions of predictors. "Prediction happens on coreset questions using a linear GAM."
gp-IRT: Generalised performance-IRT; a method combining IRT-based predictions with clustering-based weighted means. "The second method, Generalised p-IRT (gp-IRT), predicts a convex combination of the p-IRT score and an AnchorPoints score"
Item Response Theory (IRT): A probabilistic framework modeling the interaction between latent model ability and item parameters to predict correctness. "The two-parameter IRT model represents the probability of model $m$ correctly answering question $i$ as:"
K-means clustering: An algorithm that partitions data into k clusters by minimizing within-cluster variance. "\citet{vivek_anchor_2024} perform K-means clustering over $q_i \in \mathbb{R}^M, i \in [N]$ "
Kendall’s tau (τ): A rank correlation coefficient measuring the agreement between two rankings. "reduced prediction error (RMSE) and increased ranking correlation (Kendall $\tau$ )."
Kernel ridge regression: Ridge regression performed in a feature space induced by a kernel function to capture nonlinear relationships. "we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage."
Kernel trick: A technique allowing inner products in high-dimensional feature spaces to be computed via kernels without explicit mapping. "Thanks to the kernel trick (\citet{boser_training_1992}; see \cref{app:kernel_trick})"
KL-divergence: A measure of discrepancy between two probability distributions, used to express mutual information. "Equivalently, the MI can be expressed in terms of a KL-divergence between the joint distribution of the two variables"
KSG estimator: A k-nearest-neighbor-based estimator for mutual information between continuous variables. "based on the continuous-continuous KraskovâStÃ¶gbauerâGrassberger (KSG) estimator \citep{kraskov_estimating_2004}."
L1 penalty: A sparsity-inducing regularization term that encourages zero coefficients. "Replacing ridge regression's $L_2$ penalty with an $L_1$ penalty causes many of the coefficients $w_i$ to shrink to 0"
L2 penalty: A regularization term that shrinks coefficients to reduce variance and ensure invertibility. "The $L_2$ penalty shrinks coefficients and renders the matrix $(X^TX + \lambdaI)$ invertible"
Lasso: A linear model with L1 regularization that performs variable selection by zeroing coefficients. "Lasso & $L_1$ penalty & Linear model"
MAE: Mean Absolute Error; an average of absolute differences between predicted and true values. "smaller prediction errors (in both MAE and RMSE)"
Maximum likelihood: A method for estimating parameters by maximizing the likelihood of observed data. "we estimate $\theta_{m^*}$ for a new test model $m^*$ via maximum likelihood on \cref{eq:irt}"
MCMC: Markov Chain Monte Carlo; sampling-based algorithms for approximate Bayesian inference. "fitted using MCMC or variational inference."
Medoid: The most centrally located data point in a cluster (minimizing distances to others). "select the $K=n$ medoid questions to form $\mathcal{C}$ ."
Minimum Redundancy Maximum Relevance (mRMR): A feature selection criterion maximizing relevance to the target while minimizing redundancy among selected features. "an information-theoretic feature-selection algorithm called minimum redundancy maximum relevance (mRMR)"
Mutual information (MI): A measure of dependence between variables, quantifying shared information. "Mutual information measures the dependence between two variables"
NP-complete: A class of problems believed to be computationally intractable to solve exactly in polynomial time. "Finding the globally optimal subset is NP-complete \citep{davies_np-completeness_1994}"
Pass@k: A code-generation metric measuring whether any of k attempts solve a task. "such as ROUGE-L \citep{lin_automatic_2004} or pass@ $k$ "
PCA-corrected KSG estimator: A variant of the KSG MI estimator that applies PCA to improve estimation in certain settings. "we use a PCA-corrected KSG estimator \citep{gao_efficient_2015}."
Polynomial kernel: A kernel function corresponding to all monomials up to a given degree, enabling interaction features. "with degree- $d$ polynomial kernels."
Reproducing Kernel Hilbert Space (RKHS): A Hilbert space of functions associated with a kernel, enabling inner-product computations via kernels. "the Reproducing Kernel Hilbert Space, or RKHS"
Ridge regression: Linear regression with L2 regularization to prevent overfitting and handle multicollinearity. "we proceed with ridge regression \citep{hoerl_ridge_1970, frank_statistical_1993, hastie_ridge_2020, hastie_elements_2001} since it is simple, quick and effective."
RMSE: Root Mean Squared Error; the square root of the average squared difference between predicted and true values. "reduced prediction error (RMSE)"
ROUGE-L: An n-gram overlap-based metric for summarization, focusing on longest common subsequence. "For GovReport we have two metrics, ROUGE-L \citep{lin_automatic_2004} and BERTScore \citep{zhang_bertscore_2020}."
Spearman rho (ρ): A rank correlation coefficient based on the Pearson correlation of ranks. "greater ranking correlation between predicted and true scores (in both Spearman $\rho$ and Kendall $\tau$ )"
Stratified sampling: Sampling that preserves the distribution across predefined strata (bins). "The training (source) models are selected via stratified sampling over ten equally-spaced bins"
Variational inference: An optimization-based method for approximate Bayesian inference using tractable surrogate distributions. "fitted using MCMC or variational inference."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

GitHub

GitHub - sambowyer/mrmr_eval · GitHub

Efficient Benchmarking Is Just Feature Selection and Multiple Regression

Summary

Recasting Efficient LLM Benchmarking as Feature Selection and Multiple Regression

Problem Setting and Motivation

Methodology: mRMR Feature Selection and Kernel Ridge Regression

Feature Selection via mRMR

Prediction via Ridge and Kernel Ridge Regression

Comparative Evaluation and Results

Strong Numerical Claims

Analytical Insights and Ablations

Practical and Theoretical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions were the researchers trying to answer?

How did they do it?

Picking the questions (feature selection with mRMR)

Predicting the overall score (regression)

What did they find, and why is it important?

What’s the impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and caveats

Glossary

Open Problems

Continue Learning

Collections

GitHub

Tweets