AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

Published 19 Dec 2025 in cs.CL and cs.AI | (2512.17267v1)

Abstract: Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.

Abstract PDF Chat (Pro)

Summary

The paper introduces AutoMetrics, a pipeline dynamically synthesizing evaluation metrics from sparse human feedback.
It employs a multi-stage process—generation, retrieval, and regression composition—to boost correlation with human judgments by up to 33.4%.
The approach demonstrates strong data efficiency, robust sensitivity to quality degradation, and success in downstream tool optimization.

AutoMetrics: Synthesizing Automatic Evaluators from Sparse Human Feedback

Problem Context and Motivation

Effective evaluation remains a critical challenge as LLMs facilitate rapid prototyping of user-facing applications across creative, subjective, and open-ended domains (e.g., text generation, dialogue, travel planning, product recommendation). Conventional gold-standard evaluation relies on explicit user signals or behavioral metrics, but these are invariably data-scarce and cannot match the rapid development cycles of LLM applications. While reward models or rubric-based LLM-as-a-Judge mechanisms offer partial automation, these approaches suffer from data inefficiency, poor generalization, limited interpretability, and the lack of guarantees for alignment with actual user preferences. Consequently, adaptive, data-efficient metric synthesis and evaluator induction from minimal human feedback is an unsolved problem of practical and theoretical importance.

Methodology

AutoMetrics presents a general-purpose pipeline for dynamic evaluative metric synthesis in few-shot, low-resource settings. The proposed framework operates in four sequential stages:

Metric Generation: The system instantiates a large candidate set of evaluation metrics for a new task—a mix of automatically generated LLM-judge criteria, rubric-based evaluators, examples-based judges, and prompt-optimized LLM judges. Each metric is accompanied by a structured “Metric Card,” providing the definition, usage instructions, and limitations, and is designed to be cheap to generate.
Metric Retrieval and Curation: The set of generated metrics is augmented with the curated “MetricBank” (48 hand-implemented canonical metrics spanning reference-based, reference-free, semantic, fluency, and special-purpose reward models). Using a hybrid ColBERT + LLM reranker retrieval procedure, the candidate pool is filtered to a manageable subset (typically $k=30$ ).
Metric Composition via Regression: All candidate metrics are scored on available labeled data (as few as $N \sim 80$ examples). Z-scored metric outputs are combined via Partial Least Squares (PLS) regression to induce a latent evaluator most predictive of human labels. PLS is chosen for resilience to high covariance among predictors and applicability to high- $p$ , low- $n$ settings. Iterative ablation, ranking, and removal of spuriously or negatively correlated metrics yield a final composite.
Interpretability and Reporting: AutoMetrics produces a ranked metric report detailing the weight, domain, and interpretability of selected metrics, with traces of LLM-judge rationales to facilitate human oversight and system optimization.

Validity, Robustness, and Comparative Evaluation

AutoMetrics evaluation is grounded in the three central tenets of measurement validity: content, criterion, and construct validity.

Criterion validity: Assessed via Kendall’s Tau correlation with gold human ratings across five tasks (SimpEval, HelpSteer2, EvalGen, RealHumanEval, CoGym). AutoMetrics achieves up to 33.4% improvement over state-of-the-art LLM-based baselines and consistently outperforms MetaMetrics, DnA-Eval, finetuned LLMs, and all “best single metric” candidates. Notably, this performance is achieved with only $\sim$ 80 feedback points, far outstripping the data efficiency of standard reward models.
Construct validity (robustness): Sensitivity and stability are measured by generating output perturbations of known negative (quality-destroying) and neutral (quality-preserving) effect. AutoMetrics displays strong sensitivity to degradations (81–98% detection rates) and robust invariance under neutral perturbations (substantially above random baseline), both in- and out-of-distribution.
Practical data efficiency: Performance saturates after $\sim$ 80 labeled examples. “Generated metrics only” mode often yields superior results to the full MetricBank in ultra-sparse settings due to reduced risk of spurious correlations.

Baseline comparisons systematically demonstrate that AutoMetrics yields higher human correlation for all tasks when using state-of-the-art Qwen-3-32B and GPT-4o-mini backbones. LLM-judge and DnA-Eval are competitive only in special cases, but lack the interpretability, adaptability, and robustness of the induced composite metric.

Analysis of Design Decisions

Detailed ablations demonstrate the efficacy of:

Broad metric generation (including single-criterion, rubric, prompt-optimized judges), which generalizes well across tasks.
Hybrid retrieval using ColBERT+LLM over rich Metric Cards, which outperforms simple keyword or BM25-based retrieval.
PLS regression for metric combination, which consistently yields high correlation and low variance in the high- $p$ , low- $n$ regime.

AutoMetrics’ recombination process preferentially leverages generated metrics and high-quality reward models, minimizing dependence on brittle reference-based metrics.

Downstream Optimization

A salient extension is the use of AutoMetrics as a proxy reward in system optimization settings—specifically, for optimizing tool-use “agents” in the T-bench airline domain. When AutoMetrics-derived metrics are used as the GEPA optimizer reward, they match or even exceed performance obtainable from a verifiable ground-truth reward. After extensive rollouts, systems optimized using AutoMetrics demonstrate statistically significant improvement over non-optimized baselines, confirming the suitability of AutoMetrics for lifecycle model development and reward alignment.

Practical and Theoretical Implications

AutoMetrics advances the field in several dimensions:

Practicality: The approach enables interpretable, actionable evaluator synthesis suitable for rapid prototyping, ablation, and A/B testing in real-world language system development.
Scalability: Data requirements are orders of magnitude smaller than traditional reward modeling; the framework can serve under both data scarcity and emerging-task conditions.
Generalization: The method extends to subjective, open-ended, or multi-dimensional tasks where explicit reference standards are unavailable, addressing a critical gap in evaluation methodology.
Transparency and Control: By providing metric rationales, weighting, and modular reporting, human experts retain oversight, essential for trustworthy deployment in safety-critical domains.

Limitations and Future Directions

Performance is sensitive to distributional drift, quality/diversity of initial feedback, and the particular LLM used for both metric generation and application. As LLMs improve, periodic regeneration (not merely copying induced metrics) is required. High- $p$ , low- $n$ regression remains vulnerable to overfitting and spurious correlation, though built-in filtering and warning mechanisms mitigate misuse. There are open questions regarding longer-term adoption, system “gaming” of induced metrics, and possible extensions to RL/system supervision beyond the current class of generative LMs. Future work should explore continual metric induction, adversarial robustness, and real-time human-in-the-loop evaluation.

Conclusion

AutoMetrics provides a systematic, data-efficient, and interpretable pipeline for inducing evaluation metrics that reliably approximate human judgment using automatically generated metric candidates. Empirical observation across multiple tasks, datasets, and architectures demonstrates that it consistently aligns with and predicts human preference signals, surpasses existing automated evaluators, and serves as an effective proxy for downstream system optimization. By open-sourcing the AutoMetrics toolkit and MetricBank, the authors establish a natural foundation for community-driven research on adaptive, scalable, and transparent evaluation in next-generation LLM-based systems.

Reference: "AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators" (2512.17267)