- The paper introduces Automated Capability Discovery (ACD), where a foundation model generates novel tasks to evaluate itself and other models.
- The method systematically creates task families that reveal thousands of capabilities and unexpected failures beyond traditional benchmarks.
- Empirical evaluations demonstrate that ACD’s automated assessments align well with human evaluations, offering a scalable, resource-efficient framework for AI reliability.
Automated Capability Discovery via Foundation Model Self-Exploration
The paper "Automated Capability Discovery via Foundation Model Self-Exploration" introduces an innovative framework, Automated Capability Discovery (ACD), designated to assess and evaluate the capabilities of advanced foundation models (FMs). The primary intent is to address the growing complexity in evaluating FMs like GPT, Claude, and Llama, which have demonstrated extensive capabilities across diverse tasks by virtue of their training on vast datasets. Traditional evaluation methods, relying heavily on human-crafted benchmarks, are labor-intensive and inadequate for capturing the entire spectrum of these model capabilities, especially as these models saturate existing benchmarks.
ACD proposes utilizing a foundation model as a "scientist," capable of autonomously generating novel tasks to evaluate a "subject" model, which could potentially be itself. This process is aligned with principles from open-ended algorithms, enabling exploration of the unknown strengths and weaknesses of FMs through novel tasks. Each iteration in ACD involves creating task families with specific goals, instructions, and evaluation criteria. By utilizing a frontier model's ability to generate and self-assess tasks, ACD systematically discovers a model's capabilities and identifies surprising failures that standardized benchmarks often overlook.
The empirical evaluations conducted on different FMs reveal several significant findings. On one hand, ACD uncovers thousands of capabilities, many unpredictable to researchers, showing the subject models' proficiency in areas such as complex reasoning and puzzle solving. Conversely, it also highlights unexpected failures on seemingly trivial tasks, indicating critical areas for improvement. Notably, the framework showcases high correlation between model-generated evaluations and human evaluations through extensive survey validation, underlining the reliability of its automated assessment.
The implications of this research are profound for the fields of AI capability assessment and model alignment. ACD offers a scalable, resource-efficient alternative to traditional evaluations, allowing for real-time adaptability to evolving model capabilities. By identifying both emergent and potentially hazardous behaviors systematically, it holds promise for enhancing model safety and reliability. Beyond evaluation, the automated generation of tasks presents opportunities for models to identify challenges for self-improvement, further advancing autonomous AI development.
While the paper presents compelling evidence for the utility of ACD in automated evaluations, it also highlights areas for future research, such as enhancing automated judges for task evaluations and addressing the subtle nuances in determining task novelty. The transformative potential of this approach underscores the importance of innovative frameworks in the continuous development and assessment of AI systems, ensuring their alignment with human values and expectations. As such, ACD not only enriches the toolkit available for AI evaluation but also paves the way for more dynamic and autonomous model improvement methodologies.