Evolutionary Search for Automated Design of Uncertainty Quantification Methods

Published 3 Apr 2026 in cs.CL and cs.AI | (2604.03473v1)

Abstract: Uncertainty quantification (UQ) methods for LLMs are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. On the task of atomic claim verification, our evolved methods outperform strong manually-designed baselines, achieving up to 6.7% relative ROC-AUC improvement across 9 datasets while generalizing robustly out-of-distribution. Qualitative analysis reveals that different LLMs employ qualitatively distinct evolutionary strategies: Claude models consistently design high-feature-count linear estimators, while Gpt-oss-120B gravitates toward simpler and more interpretable positional weighting schemes. Surprisingly, only Sonnet 4.5 and Opus 4.5 reliably leverage increased method complexity to improve performance -- Opus 4.6 shows an unexpected regression relative to its predecessor. Overall, our results indicate that LLM-powered evolutionary search is a promising paradigm for automated, interpretable hallucination detector design.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces an LLM-powered evolutionary framework that automatically synthesizes interpretable uncertainty quantification methods.
It demonstrates statistically significant improvements in claim verification tasks, achieving up to a 6.7% ROC-AUC uplift over baseline methods.
The study reveals model-dependent complexity tradeoffs and underscores the potential of evolutionary search for scalable UQ design across diverse applications.

Automated Evolutionary Design of Uncertainty Quantification Methods

Overview

The paper "Evolutionary Search for Automated Design of Uncertainty Quantification Methods" (2604.03473) tackles the automated discovery of uncertainty quantification (UQ) algorithms for hallucination detection in LLMs. The conventional approach to UQ for LLMs relies on hand-crafted techniques informed by domain knowledge, which limits scalability and often yields suboptimal performance in out-of-distribution (OOD) scenarios. The authors leverage LLM-based evolutionary search to synthesize unsupervised UQ methods, represented as explicit Python programs, improving both interpretability and scalability relative to end-to-end deep learning or manual feature engineering.

Methodological Framework

The authors employ the EvoTune evolutionary search framework, enabling LLM-driven mutation and refinement within a fixed prompt structure. Initial candidates are seeded with a strong baseline (typically Sequence Probability), and successive generations of UQ methods are produced by an LLM under stochastic sampling (controlled by temperature). Fitness evaluation is performed on a single training dataset (PopQA), with further generalization assessed on multiple OOD claim verification datasets. The evolutionary pipeline is agnostic to input representation, supporting statistics derived from logits, attention maps, or hidden states.

Distinct prompting strategies foster the synthesis of both simple and complex UQ routines. Notably, complexity constraints ("use up to 3 features") are variably respected across models: Gpt-oss-120B adheres strictly, while Claude models disregard and generate high-dimensional linear estimators.

Experimental Evaluation

Claim Verification

Autonomously evolved UQ methods outperform established baselines on atomic claim verification across nine datasets, with statistically significant improvements in ROC-AUC (+6.7% relative uplift in some cases). The evolutionary search uses only the PopQA train split for candidate evaluation, yet methods generalize robustly to diverse sources, including human-written, synthetic, and long-form LLM output fragments.

Pairwise bootstrap significance testing confirms that evolved solutions (especially those from Claude models) deliver superior performance compared to Sequence Probability and attention-based UQ (RAUQ), with six datasets evidencing robust gains.

Selective Prediction

Cross-task generalization to selective prediction is less pronounced. Methods optimized for binary claim verification rarely transfer well to tasks requiring nuanced quality discrimination (e.g., summarization with ROUGE-L metrics). Evolution on task-specific datasets (CoQA, SAMSum) yields competitive, lightweight estimators that do not depend on expensive attention map computation, suggesting the approach's adaptability.

Feature and Complexity Analysis

Ablation studies reveal that UQ estimators relying solely on logit-based features are optimal; attention and hidden state signals do not afford performance gains. Further, model behavior diverges significantly with respect to complexity: only Sonnet 4.5 and Opus 4.5 (Claude family) exploit increased method complexity to achieve improved generalization, while Opus 4.6 demonstrates a non-monotonic regression—conjectured to result from output diversity collapse. Logistic regression trained on feature sets from evolved methods produces distinct weightings, signifying that evolutionary search explores solution spaces orthogonal to gradient-based optimization.

Similarity metrics (Spearman correlation on predictions) confirm that evolved UQ methods are not merely variants of Sequence Probability—they induce different input rankings with comparable accuracy, evidencing diverse algorithmic discovery.

Theoretical and Practical Implications

The results foreground LLM-driven evolutionary search as a scalable paradigm for automated program synthesis in UQ. By explicitly guiding LLM search in constrained but expressive domains, the approach generates interpretable, computationally efficient hallucination detectors that avoid the brittleness of hand-engineered heuristics.

The behavioral heterogeneity across evolution-driving LLMs presents a compelling theoretical question: frontier models encode qualitative priors affecting exploration strategies, complexity exploitation, and eventual generalization. The implicit regularization observed in high-feature-count methods diverges from classical statistical learning theory and gradient optimization principles, hinting at new mechanisms in LLM-based program search. This warrants further exploration in model family, scale, and decoding regimes.

Practically, the evolutionary search enables rapid prototyping of lightweight UQ methods with strong OOD generalization, potentially extending to tasks beyond hallucination detection (e.g., scientific discovery, hardware design). The explicit program representation facilitates downstream interpretability, auditing, and deployment.

Limitations and Future Directions

Limitations include reliance on a single model's internal signals (Llama-3.1-8B-Instruct), a narrow focus on unsupervised, interpretable programs, and candidate evaluation on a single training set. Extending the framework to supervised or neural detector families, broader dataset diversity, and varied generator architectures remains an open avenue.

The empirical observation of model-dependent complexity-performance tradeoff, and the non-trivial generalization of high-dimensional evolved estimators, invite further theoretical investigation. Understanding how LLM knowledge and discrete evolutionary selection interleave to produce implicit regularization may inform future automated algorithm discovery frameworks.

Conclusion

LLM-powered evolutionary search provides a robust framework for automated, interpretable UQ method synthesis. Evolved estimators surpass manual baselines in atomic claim verification and demonstrate competitive performance in selective prediction tasks, with clear evidence for strong OOD generalization. The approach surfaces behavioral idiosyncrasies across frontier LLMs and points toward new research directions in program synthesis and hallucination mitigation. Further work may generalize these findings to complex scientific and engineering domains, leveraging LLM-driven evolution as a scalable alternative for automated algorithm design.