Query-efficient model evaluation using cached responses

Published 8 May 2026 in cs.LG, cs.AI, and stat.ME | (2605.07096v1)

Abstract: Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a DKPS method that predicts benchmark scores using cached responses, reducing queries by up to 16x while maintaining comparable MAE.
The methodology leverages multidimensional scaling of embedded responses to construct a low-dimensional space where similar outputs yield similar scores.
Empirical results demonstrate that combining offline query selection with DKPS regression enables robust, cost-effective model evaluation across diverse benchmarks.

Query-efficient Model Evaluation using Cached Responses

Motivation and Problem Formulation

The proliferation of large-scale multi-task benchmarks has shifted model evaluation from single-task accuracy measurement to holistic assessment across diverse domains. However, the cost associated with fully scoring new models—especially in the context of generative or LLMs—has grown prohibitive due to the need for thousands of query-response pairs. Simultaneously, the production of model variants via techniques such as parameter-efficient fine-tuning, instruction tuning, model merging, and prompt engineering exposes a bottleneck: evaluation resources cannot keep pace with model generation.

This work proposes leveraging cached responses from previously evaluated models to reduce the required queries for new model benchmarking. The central objective is to predict a model's benchmark score from a small subset of its responses and the reference corpus of models and their cached responses. The approach is strictly black-box and response-based, not requiring access to model internals or scoring functions beyond those used by public benchmarks.

Data Kernel Perspective Space Methodology

At the core of the framework is the Data Kernel Perspective Space (DKPS), originally introduced for multi-agent monitoring. DKPS constructs a low-dimensional Euclidean representation of models based on their average embedded responses to a set of queries. Pairwise distances in this induced space correspond to empirical response similarities, permitting the use of standard statistical inference techniques for model-level benchmarking.

DKPS representations are computed via multidimensional scaling on matrices of embedded responses, with the embedding function $g$ mapping responses into $\mathbb{R}^p$ . The geometry of DKPS is determined jointly by the choice of reference models, query subset, and embedding function, with quality bounded theoretically as the number of models, queries, and replicates grows.

Theoretical Guarantees of Query Efficiency

The paper establishes rigorous query-efficiency bounds for DKPS-based regression under minimal assumptions:

Lipschitz continuity: The benchmark score function is Lipschitz in DKPS, such that models close in DKPS have similar benchmark scores.
Dense support: The distribution of reference models covers the target model's vicinity in DKPS.

Given these, the mean squared error for DKPS-based nearest neighbor regression can be made arbitrarily small with sufficiently many reference models, queries, and replicates. Notably, DKPS-based predictions outperform simple subsample score methods for identical query budgets, especially when the latter is non-exact (i.e., when not all queries are used), providing a formal foundation for query-efficient inference.

Empirical Results and Numerical Analysis

Empirical validation on the HELM-Lite benchmark suite demonstrates the following:

DKPS-based methods achieve identical mean absolute error (MAE) as baseline sample-score approaches with an order-of-magnitude fewer queries (e.g., MAE ~0.048 at $m=1$ with DKPS matches sample-score MAE at $m=16$ , a 16x reduction for LegalBench).
The Ensemble method, optimally combining DKPS and sample-score predictions, consistently dominates across all query budgets and tasks, exhibiting lower MAE for nearly all (task, $m$ ) pairs.
Numerical gains are persistent in out-of-family evaluation protocols (LOFO), confirming robustness to architectural and training differences.
The choice of embedding function critically affects DKPS performance at low query budgets, leading to up to 20% MAE reduction.
Offline query set selection, using goodness-of-fit of DKPS regression (via $R^2$ criteria on reference models), yields lower prediction error, especially at small $m$ , with gains broadly distributed across reference models.

Practical Implications and Extensions

The work has several practical implications:

Practitioners can achieve comprehensive benchmark predictions with approximately 10% of the original query budget, enabling more frequent and less expensive model evaluation.
DKPS-based methods are applicable to modalities beyond text, provided responses can be embedded, and to alternative distance metrics (beyond Frobenius norm).
The black-box nature permits benchmarking in settings where response-level scores are unavailable, expensive, or proprietary—facilitating human-based or costly evaluations.
Offline query selection via DKPS enables principled active sampling, further compounding efficiency gains.
Integration of DKPS with principled subset selection (e.g., Item Response Theory, anchor point selection) extends utility to structured benchmarks.

The approach is particularly impactful in continuous benchmarking scenarios, where datasets evolve, models are asynchronously evaluated, and caching efficiency is paramount. Limitations include the need for common query sets among reference models and deterministic scoring functions, with promising avenues for extension via matrix completion and stochastic inference.

Theoretical and Future Directions

Theoretical guarantees extend to consistent estimation of DKPS representations and concentration bounds on response-based embeddings, permitting principled inference even in high-dimensional settings (2605.07096). Future directions include:

Extending DKPS to incomplete query sets via aggregation or matrix completion, broadening practical scope.
Incorporating stochastic scoring and multiple response samples for LLM-as-judge paradigms.
Identifying task conditions predictive of DKPS efficiency to optimize deployment.
Exploring adaptive embedding and dimension selection strategies for further accuracy and efficiency improvements.

Conclusion

This work establishes a robust framework for query-efficient model evaluation, leveraging cached responses through DKPS to predict benchmark scores with formal guarantees and empirical efficiency exceeding 10x reductions in query requirements. The approach is immediately deployable, synergizes with other selection techniques, and positions DKPS as a key abstraction for scalable model benchmarking as evaluation costs continue to escalate. The results underscore the utility of cached response information for democratizing model assessment and optimizing evaluation workflows with minimal loss in accuracy.

Markdown Report Issue