- The paper introduces a framework that classifies AGI systems by performance and generality, establishing clear levels of capability.
- It outlines six key principles to guide the evaluation of AI progress, emphasizing ecological validity and responsible deployment.
- The study proposes a dynamic benchmark for AGI that integrates diverse cognitive tasks and human-AI interaction measures.
Levels of AGI for Operationalizing Progress
The paper "Levels of AGI for Operationalizing Progress on the Path to AGI" (2311.02462) introduces a framework for classifying AI systems based on their capabilities, generality, and autonomy, drawing an analogy to the levels of autonomous driving. The framework aims to provide a common language for comparing models, assessing risks, and measuring progress toward AGI. The paper analyzes existing definitions of AGI and distills six principles for a useful AGI ontology: focusing on capabilities, evaluating generality and performance separately, and defining stages toward AGI. The authors propose "Levels of AGI" based on depth (performance) and breadth (generality) of capabilities, discuss how current systems fit into this ontology, address the requirements for future benchmarks, and emphasize the importance of Human-AI Interaction paradigms for responsible and safe deployment.
Case Studies of AGI Definitions
The paper analyzes nine prominent AGI definitions, including the Turing Test, Strong AI, analogies to the human brain, human-level performance on cognitive tasks, the ability to learn tasks, economically valuable work, flexible and general intelligence, Artificial Capable Intelligence (ACI), and SOTA LLMs as generalists. It argues that the Turing Test is insufficient, Strong AI is impractical, and analogies to the human brain are not inherently necessary. While definitions emphasizing human-level performance, learning ability, and economic value have strengths, they also have shortcomings. The "Coffee Test" requires robotic embodiment, while the assertion that SOTA LLMs are already AGIs overlooks the need for reliable correctness.
Six Principles for Defining AGI
The paper articulates six key principles for defining AGI:
- Focus on capabilities, not processes, excluding human-like thinking and consciousness as requirements.
- Focus on generality and performance as key components.
- Focus on cognitive and metacognitive tasks, with metacognitive abilities being key prerequisites for generality.
- Focus on potential, not deployment, to avoid non-technical hurdles.
- Focus on ecological validity, choosing tasks that align with real-world tasks that people value.
- Focus on the path to AGI, not a single endpoint, advocating for "Levels of AGI."
Levels of AGI Ontology
The paper introduces a matrixed leveling system based on performance and generality. Performance refers to the depth of an AI system’s capabilities compared to human-level performance, while generality refers to the breadth of tasks for which an AI system reaches a target performance threshold. The levels range from Level 0 (No AI) to Level 5 (Superhuman), with intermediate levels including Emerging, Competent, Expert, and Virtuoso. The taxonomy specifies the minimum performance over most tasks, allowing systems to have higher performance on a subset of tasks. Frontier LLMs are considered Level 1 General AI ("Emerging AGI") until their performance increases across a broader set of tasks. The highest level, ASI, refers to systems capable of a wide range of tasks at a level no human can match, including tasks qualitatively different from existing human skills.
Testing for AGI
The paper emphasizes the importance of measurement, particularly in terms of the set of tasks that constitute the generality criteria. It suggests that an AGI benchmark should include a broad suite of cognitive and metacognitive tasks, measuring diverse properties such as linguistic intelligence, mathematical and logical reasoning, and creativity. The benchmark might include tests covering psychometric categories, but these must be evaluated for suitability for benchmarking computing systems. The paper suggests that the benchmark should be a living benchmark that includes a framework for generating and agreeing upon new tasks.
Risk, Autonomy, and Human-AI Interaction
The paper discusses the relationship between levels of AGI and different types of AI risk, including misuse, alignment, and structural risks. It introduces six Levels of Autonomy, ranging from "No AI" to "AI as an Agent," and argues that higher levels of autonomy are "unlocked" by AGI capability progression. The paper emphasizes the importance of carefully considered choices around human-AI interaction and alignment, and argues that the interplay of model capabilities and interaction design will enable more nuanced risk assessments and responsible deployment decisions.
Conclusion
The paper concludes by emphasizing the need for a clear, operationalizable definition of AGI, and introduces a "Levels of AGI" ontology that considers generality and performance. The authors discuss the implications of their principles for developing an AGI benchmark and reshaping discussions around the risks associated with AGI, noting that AGI is not necessarily synonymous with autonomy. They advocate for investing in human-AI interaction research in tandem with model improvements.