Automated Capability Discovery via Foundation Model Self-Exploration (2502.07577v3)

Published 11 Feb 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Foundation models have become general-purpose assistants, exhibiting diverse capabilities across numerous domains through training on web-scale data. It remains challenging to precisely characterize even a fraction of the full spectrum of these abilities and potential risks in any new model. Existing evaluation approaches often require significant human effort, and it is taking increasing effort to design ever harder challenges for more capable models. We introduce Automated Capability Discovery (ACD), a framework that designates one foundation model as a scientist to systematically propose open-ended tasks probing the abilities of a subject model (potentially itself). By combining frontier models with ideas from the field of open-endedness, ACD automatically and systematically uncovers a diverse spectrum of surprising capabilities and failures in the subject model. We demonstrate ACD across a range of foundation models (including the GPT, Claude, and Llama series), showing that it automatically generates thousands of distinct tasks, which are then clustered to reveal dozens of broader capability areas and failure modes, that would be challenging for any single team to uncover. We further validate our method's automated scoring with extensive human surveys, observing high agreement between model-generated and human evaluations. By leveraging foundation models' ability to both create tasks and self-evaluate, ACD is a significant step toward scalable, automated evaluation of novel AI systems. All code and evaluation logs are open-sourced at https://github.com/conglu1997/ACD.

Summary

The paper introduces Automated Capability Discovery (ACD), where a foundation model generates novel tasks to evaluate itself and other models.
The method systematically creates task families that reveal thousands of capabilities and unexpected failures beyond traditional benchmarks.
Empirical evaluations demonstrate that ACD’s automated assessments align well with human evaluations, offering a scalable, resource-efficient framework for AI reliability.

Automated Capability Discovery via Foundation Model Self-Exploration

The paper "Automated Capability Discovery via Foundation Model Self-Exploration" introduces an innovative framework, Automated Capability Discovery (ACD), designated to assess and evaluate the capabilities of advanced foundation models (FMs). The primary intent is to address the growing complexity in evaluating FMs like GPT, Claude, and Llama, which have demonstrated extensive capabilities across diverse tasks by virtue of their training on vast datasets. Traditional evaluation methods, relying heavily on human-crafted benchmarks, are labor-intensive and inadequate for capturing the entire spectrum of these model capabilities, especially as these models saturate existing benchmarks.

ACD proposes utilizing a foundation model as a "scientist," capable of autonomously generating novel tasks to evaluate a "subject" model, which could potentially be itself. This process is aligned with principles from open-ended algorithms, enabling exploration of the unknown strengths and weaknesses of FMs through novel tasks. Each iteration in ACD involves creating task families with specific goals, instructions, and evaluation criteria. By utilizing a frontier model's ability to generate and self-assess tasks, ACD systematically discovers a model's capabilities and identifies surprising failures that standardized benchmarks often overlook.

The empirical evaluations conducted on different FMs reveal several significant findings. On one hand, ACD uncovers thousands of capabilities, many unpredictable to researchers, showing the subject models' proficiency in areas such as complex reasoning and puzzle solving. Conversely, it also highlights unexpected failures on seemingly trivial tasks, indicating critical areas for improvement. Notably, the framework showcases high correlation between model-generated evaluations and human evaluations through extensive survey validation, underlining the reliability of its automated assessment.

The implications of this research are profound for the fields of AI capability assessment and model alignment. ACD offers a scalable, resource-efficient alternative to traditional evaluations, allowing for real-time adaptability to evolving model capabilities. By identifying both emergent and potentially hazardous behaviors systematically, it holds promise for enhancing model safety and reliability. Beyond evaluation, the automated generation of tasks presents opportunities for models to identify challenges for self-improvement, further advancing autonomous AI development.

While the paper presents compelling evidence for the utility of ACD in automated evaluations, it also highlights areas for future research, such as enhancing automated judges for task evaluations and addressing the subtle nuances in determining task novelty. The transformative potential of this approach underscores the importance of innovative frameworks in the continuous development and assessment of AI systems, ensuring their alignment with human values and expectations. As such, ACD not only enriches the toolkit available for AI evaluation but also paves the way for more dynamic and autonomous model improvement methodologies.