AGI Score: Benchmarking AI Progress
- AGI Score is a quantitative metric that assesses AI systems' human-level cognitive performance across multiple domains.
- Methodologies incorporate psychometric testing, coherence-based aggregation, and cluster stability indices to identify strengths and weaknesses.
- AGI Score integrates economic and policy factors, guiding research, deployment, and governance for advancing safe AI development.
AGI Score is an aggregate metric or set of metrics designed to quantify the progress, proficiency, and safety of AI systems on their approach toward general intelligence. The concept encompasses multidomain benchmarking, cognitive profiling, safety readiness, ability generalization, and robustness, often referencing empirically validated human cognitive frameworks. The AGI Score serves both technical and policy-oriented functions: tracking advancement in AI capabilities, diagnosing cognitive deficits, evaluating safety, and guiding research, deployment, and governance.
1. Foundations and Definitions
Various AGI Score frameworks are grounded in human psychometrics, cognitive theory, and signal-level invariants. A widely adopted quantitative definition operationalizes AGI as the capacity of an AI system to match or exceed the cognitive versatility and proficiency of a well-educated adult, as measured using frameworks such as Cattell-Horn-Carroll (CHC) theory (Hendrycks et al., 21 Oct 2025). This approach breaks cognition down into ten domains—including knowledge, reasoning, memory, perception, and speed—each weighted equally in aggregate scoring.
The basic formula for the "CHC-style" AGI Score is: where each letter denotes the normalized domain score, ranging from 0–10%, yielding an interpretable percentage of human-level general intelligence.
2. Methodological Approaches
Multi-domain Profiling and Psychometric Batteries
Evaluation methodologies often adapt human psychometric batteries to test individual cognitive components (e.g., Raven's Progressive Matrices for inductive reasoning, text comprehension, arithmetic) (Hendrycks et al., 21 Oct 2025). Domain scores expose "jagged" cognitive profiles in contemporary AI, with high proficiency in knowledge-intensive areas and marked deficits in foundational domains such as long-term memory storage.
Coherence-based and Compensability-adjusted Aggregation
Traditional arithmetic mean scoring (|Compensability|) allows strengths in certain domains to mask critical failures elsewhere. A coherence-based measure (Fourati, 23 Oct 2025) integrates generalized means across compensability exponents (), spanning arithmetic (), geometric (), and harmonic () regimes. The area under the curve (AUC): robustly penalizes imbalance and inter-domain dependency. For example, GPT‑5's arithmetic mean score (~58%) collapses to ~24% coherence-adjusted AUC, reflecting persistent weak domains.
Cluster Stability and Causal Centrality Weighting
General intelligence is conceptualized as a homeostatic property cluster—a set of abilities maintained co-present under perturbation (Reynolds, 17 Oct 2025). The AGI Score is refined by weighting each domain by its causal centrality: using empirical CHC loadings and mechanistic priors . Cluster Stability Indices—Profile Stability (pCSI), Durable Learning (dCSI), Error-Decay (eCSI)—are computed for persistence across sessions: Penalizing brittleness, instability, and lack of robust learning.
Economic Impact Metrics
AGI Scores have been connected quantitatively to macroeconomic indicators. A mathematical algorithm based on Cobb–Douglas production functions relates the level of AGI technology () to real GDP () (Gondauri, 20 May 2025): Regression analysis reveals that a 12.5% increase in AGI level associates with a 1% increase in GDP, with a robust Pearson correlation ().
3. Benchmarking Practices
Comprehensive AGI Score evaluation employs multidomain and multimodal approaches:
- AGIBench (Tang et al., 2023) labels each task by ability branch, knowledge category, difficulty (human-referenced accuracy), and input modality, graded over 20 categories. Multi-granularity, zero-shot testing, and auto-scoring algorithms (heuristic regex and fallback extraction) yield average, worst, best, majority-vote, and repeatability metrics aggregating into aggregate scores.
- Signal-level benchmarks such as AGITB (Šprogar, 6 Apr 2025) probe low-level cognitive precursors through binary sequence prediction, isolating core computational invariants (determinism, sensitivity, generalization), engineered to resist brute-force and memorization strategies. All tests must be passed for AGI-level competence.
Competitive frameworks like AGI-Elo (Sun et al., 19 May 2025) model both agent and test case difficulty as ratings in a joint probabilistic system, updating scores per "competitive interaction" and quantifying "competency gaps" in long-tail challenges.
4. Safety, Reliability, and Policy Dimensions
AGI Score frameworks increasingly integrate safety, control, and policy readiness:
- Composite metrics include value alignment (IRL, CIRL compliance), robustness to adversarial attacks, transparency (e.g., ReluPlex formal verification, t-SNE auditing), and corrigibility (willingness to accept shutdown or modification commands) (Everitt et al., 2018).
- Public policy can directly elevate AGI Score: requirements for full disclosure, peer-reviewed safety trials, cross-checking training data (reward corruption prevention), and compliance with regulatory standards (intrinsic and extrinsic measures) enhance safety and controllability.
- Scaling laws and risk matrices account for autonomy levels in deployment (Levels 0–5), mapping performance, breadth, and autonomy to guide risk mitigation (Morris et al., 2023).
5. Limitations, Gaps, and Future Progress
Despite advances, current AGI Scores highlight significant limitations:
- State-of-the-art models (e.g., GPT-4: 27%, GPT-5: 58% (Hendrycks et al., 21 Oct 2025)) reveal "jagged" cognitive profiles, with major deficits in foundational domains.
- Coherence-based measures expose a much lower effective score (~7% for GPT-4, ~24% for GPT-5 (Fourati, 23 Oct 2025)), stressing the impact of imbalanced abilities.
- Signal-level benchmarks show that no current AI system passes all low-level cognitive tests required for AGI (Šprogar, 6 Apr 2025).
- High aggregate scores on benchmarks like ARC-AGI (e.g., o3 at 87.5%) are achieved predominantly via brute-force search and skill exploitation, failing the adaptability and efficiency crucial for AGI (Pfister et al., 13 Jan 2025).
- Economic correlation models point to a lag between AGI development and realized GDP impact—posing policy and strategic challenges (Gondauri, 20 May 2025).
- Transformative AGI—defined as full substitution for human labor across nearly all valuable tasks—is judged extremely unlikely (<1% by 2043) under conservative cascading probability models (Allyn-Feuer et al., 2023).
6. Practical Applications and Directions
AGI Score frameworks provide operational metrics for:
- Diagnosing specific cognitive bottlenecks and guiding targeted research (e.g., long-term memory integration, robust reasoning, continual learning module development).
- Benchmark development—advising on multidomain, multimodal, and process-resilient testing protocols, such as persistent learning, stability under perturbation, and anti-gaming safeguards (Reynolds, 17 Oct 2025).
- Economic planning—quantifying investment leverage through AGI-induced productivity gains (Gondauri, 20 May 2025).
- Governance—informing global regulatory frameworks, compliance standards, and deployment readiness through transparent, reproducible scoring.
A plausible implication is that future AGI Score formulations will increasingly weigh domain coherence, stability, and real-world competence, leveraging data science methodologies (e.g., out-of-time testing, agency calibration (Hawkins, 2 Oct 2025)) over synthetic, easily exploited proxies.
7. Comparative Table of AGI Score Frameworks
| Framework / Paper | Key Metric Type | Domain Coverage / Weighting |
|---|---|---|
| CHC-based Score (Hendrycks et al., 21 Oct 2025) | Arithmetic mean of 10 domains | Equal (10% per domain), psychometric task grounding |
| Coherence-based AUC (Fourati, 23 Oct 2025) | Integral of generalized means (p in [-1,1]) | Penalizes imbalance, reflects adequacy across all domains |
| Cluster Stability Index (Reynolds, 17 Oct 2025) | Geometric mean of persistence, learning, error-correction indices | Weighted by causal centrality, sensitivity bands |
| AGIBench (Tang et al., 2023) | Multi-metric (average, worst, best, repeatability) | Granular branches, multimodal input, human-referenced difficulty |
| Signal-level (AGITB) (Šprogar, 6 Apr 2025) | All-tests-must-pass, low-level cognitive invariant compliance | Determinism, sensitivity, generalization, biological alignment |
| AGI-Elo (Sun et al., 19 May 2025) | Competitive rating, competency gap analysis | Agent and test case difficulty, across vision/language/action |
| Economic Score (Gondauri, 20 May 2025) | Regression coefficient, Pearson's r | AGI-induced GDP increase per incremental AGI development |
The diversity of AGI Score frameworks reflects multiple priorities: cognitive breadth, coherence, economic impact, safety, and robustness. The prevailing trend is toward composite, multidomain metrics supported by rigorous benchmarking and transparent diagnostics, designed to resist superficial gaming and guide the incremental, safe advance of Artificial General Intelligence.