EssenceBench: Scalable LLM Benchmark Compression
- EssenceBench is a benchmark compression framework that selects minimal, decisive coresets to efficiently evaluate large language models.
- It combines redundancy-aware filtering with multi-stage genetic algorithm optimization to maintain score reconstruction fidelity and model ranking consistency.
- Empirical results on benchmarks like HellaSwag show up to 95% ranking preservation using a fraction of the original samples.
EssenceBench is a coarse-to-fine benchmark compression and ranking preservation framework designed to efficiently evaluate LLMs using substantially compressed subsets of benchmark data. Its principal innovation lies in the combination of redundancy-aware sample filtering and multi-stage Genetic Algorithm (GA) optimization to retain the core evaluative power of large benchmarks—such as HellaSwag—while drastically reducing the computational burden. EssenceBench surpasses prior methods by maintaining both score reconstruction fidelity and longitudinal model ranking consistency, supporting rapid and scalable benchmarking as LLM evaluations expand in scope and cost (Wang et al., 12 Oct 2025).
1. Motivation and Theoretical Foundation
EssenceBench was developed in response to scaling issues in LLM evaluation, where benchmark suites (e.g., HellaSwag, MMLU, ARC) have grown to tens of thousands of samples, making comprehensive model evaluation computationally expensive. Large suites often harbor considerable redundancy—both in text content and in ranking information—resulting in superfluous samples that contribute little novel evaluative signal. The task addressed by EssenceBench is to identify and select a minimal subset (“coreset”) that, when used to assess a suite of models, yields an aggregate score and preserves their relative rankings consistent with the full benchmark.
The framework formalizes this as an optimization problem:
where is a binary mask indicating the selected k samples, is the vector of model accuracies on the full set, is the binary response (correct/incorrect) matrix, and is a Generalized Additive Model (GAM) used to map the compressed-score results back to the original aggregate scores. The loss function is Root Mean Square Error (RMSE) for score reconstruction.
2. Redundancy-Aware Filtering
EssenceBench initiates compression with a coarse filtering phase, targeting two forms of redundancy:
- Text-level redundancy: Pairwise semantic similarities are computed between sample texts (using embeddings). Highly similar samples are detected, and only the first of each group of similar samples is retained.
- Ranking-level redundancy: Performance rankings induced by the samples are compared using correlation metrics (Pearson, Spearman). Samples that lead to near-identical rankings across models are identified; extraneous samples are eliminated, ensuring that retained items collectively diversify the induced model orderings.
This stage reduces the search space for subsequent GA-based optimization, discarding samples with low discriminative power before invoking costlier search procedures.
3. Genetic Algorithm Optimization
Having pruned the dataset, EssenceBench applies a two-stage iterative GA for fine compression:
- Fitness-based subset search: Candidate subsets are encoded as binary masks. Standard GA operations—tournament selection, crossover (combining parent masks), mutation (random bit flips), and k-element constraint enforcement—optimize the loss (RMSE between the subset-predicted and full scores). At the end of each generation, the best-performing individuals (subsets) are selected for further rounds.
- Attribution-based sample search: An Explainable Boosting Machine (EBM) is trained to obtain “attribution scores” for each sample in the elite subsets. These scores quantify each sample’s contribution to score prediction. Samples are partitioned into high-, low-, and random-attribution groups, and further GA runs are performed within these partitions to enhance diversity while maintaining overall fidelity.
4. Evaluation Metrics and Empirical Results
EssenceBench’s core contributions are validated on the HellaSwag benchmark (~10,000 samples). The primary metrics are:
- Root Mean Square Error (RMSE): Measures error in aggregate score reconstruction from compressed subsets.
- Correlation coefficients: Pearson’s r, Spearman’s ρ, Kendall’s τ between scores/rankings on the full and compressed subsets.
- Ranking-specific metrics: Mean positional deviation, pair accuracy, NDCG@50, top-tier retrieval accuracy.
Empirical findings indicate that EssenceBench:
- Achieves ranking preservation within a 5% shift using 1/25-th the original samples.
- Maintains 95% ranking preservation with a 10% shift using only 1/200-th the samples.
- Outperforms baselines (MetaBench, Random Sampling, GraNd, Perplexity) in both ranking retention and score fidelity at multiple compression ratios.
- Compression does not introduce systematic ranking bias; both top and lower-tier models maintain positions with high accuracy.
5. Comparative Framework Analysis
Relative to prior work, EssenceBench integrates coarse redundancy filtering with optimization and attribution-based refinement. MetaBench uses IRT-based information and random subsampling, GraNd applies gradient-based sample selection, and perplexity-based methods select samples based on LLM response uncertainty. EssenceBench’s combination of redundancy metrics and population-based optimization yields lower RMSE and higher ranking stability, particularly on long-tailed model performance distributions.
6. Practical Implications and Limitations
The framework is applicable to any large-scale LLM benchmark, supporting efficient leaderboard maintenance and fast-turnaround comprehensive evaluation. This reduction in necessary test items translates to lower computational cost, runtime, and resource requirements, especially critical in commercial and academic benchmarking contexts. However, calibration of redundancy thresholds and GA hyperparameters (number of generations, population size, rounds) requires empirical attention. Further, the choice of scoring and attribution models (e.g., GAM, EBM) may benefit from future developments adapted to emerging model families or more challenging benchmarks.
7. Future Directions and Extensions
Continued research may refine sample selection metrics beyond text/ranking redundancy, incorporating behavioral or error analysis. Extensions to adaptive evaluation (e.g., tailored subsets per model under test), integration with psychometric modeling, or compressed benchmarks for multi-modal LLMs represent promising directions. As benchmarks continue to grow in scale and complexity, the principle of redundancy-aware compression and ranking-preserving coresets will remain central to efficient model evaluation frameworks.
EssenceBench constitutes a robust, data-driven methodology for benchmark subset selection, enabling low-error reconstruction and highly faithful model comparisons with orders-of-magnitude fewer test items. Its design, grounded in redundancy-aware filtering and optimization, positions it as an indispensable tool for future scalable LLM evaluation (Wang et al., 12 Oct 2025).