SuperARC Test: Algorithmic Intelligence Evaluation
- The SuperARC Test is an evaluation framework based on algorithmic probability, using recursive compression to assess an AI's ability to model data.
 - It employs a CTM/BDM hybrid approach to search for minimal generative programs that abstract and predict input sequences.
 - Empirical findings reveal that neurosymbolic models outperform LLMs in abstraction and prediction, emphasizing limits of data-driven methods.
 
The SuperARC Test is an open-ended, algorithmically principled evaluation of machine intelligence, founded on the principles of algorithmic probability and recursive compression. It is designed to provide an agnostic, contamination-resistant framework for distinguishing between narrow AI, artificial general intelligence (AGI), and superintelligence (ASI) by quantifying an agent’s ability to abstract, synthesize, and predict through model creation, rather than statistical memorization or pattern matching (Hernández-Espinosa et al., 20 Mar 2025).
1. Theoretical Foundations
SuperARC grounds its test design in Algorithmic Information Theory, specifically leveraging the concepts of Kolmogorov complexity and algorithmic probability. Intelligence in this context is operationally defined as the system’s ability to produce a computer-executable “model” that losslessly and minimally reproduces the observed data. Two complementary capacities are central:
- Recursive Compression (Abstraction): The ability to encode the salient features of a dataset into the shortest possible generative program.
 - Prediction (Planning): The ability to decompress or use the model to generate subsequent tokens or predict future data points.
 
Formally, the Kolmogorov complexity for a string is defined as:
with denoting a universal Turing machine.
This theoretical underpinning equates predictive power directly with compressive power: superior prediction implies superior compression, and vice versa.
2. Methodological Structure
The SuperARC Test employs methodological constructs explicitly designed to evade problems of benchmark contamination endemic to static challenge sets. Its methodology can be summarized as:
- Algorithmic Model Search: The test requires the discovery of a compression algorithm and its inverse , such that for the target sequence :
 
subject to , where is a measure of description length or model complexity.
- CTM/BDM Hybrid: The Coding Theorem Method (CTM) is used for exhaustive enumeration and probability estimation of small programs. For larger structures, the Block Decomposition Method (BDM) is employed:
 
where is decomposed into blocks, enabling the hybrid assessment of local algorithmic complexity and global structure.
3. Distinctions from Conventional Benchmarks
Unlike classical intelligence benchmarks or the static Abstraction and Reasoning Corpus (ARC) challenge, SuperARC is explicitly open-ended. This prevents models from benefiting through indirect memorization, precomputed answer sets, or information leakage over time. The test’s foundation in algorithmic probability, rather than Shannon entropy or statistical compression (e.g., gzip, LZW), ensures that it evaluates for genuine model synthesis capabilities beyond surface-level pattern recognition.
The emphasis shifts from answer correctness in a human-annotated dataset to the capacity for synthesis: a correct answer must be represented as a minimal generative program (e.g., algorithm, formula), penalizing solutions that merely return or “print” the target sequence without offering interpretive abstraction.
4. Evaluation Metrics and Scoring
Assessment of performances is based on both the correctness and the compressiveness of generated models:
- Solutions are classified by type: non-print/non-ordinal (algorithmic), ordinal, trivial print, or incorrect.
 - The principal metric combines solution types with the degree of compression,
 
where denote the fractions of outputs in each category, and are the relative weightings that reflect achieved compression.
Scoring inherently rewards algorithmic abstraction over rote repetition. For example, a solution that identifies and formalizes the generative law behind a sequence achieves a higher score than one that produces an explicit output enumeration.
5. Empirical Findings on LLMs and Neurosymbolic Models
Empirical results yielded critical insights into the performance of LLMs versus algorithmic-neurosymbolic approaches. Key findings include:
- LLM Limitations: When presented with non-trivial binary or non-binary integer sequences, modern LLMs predominantly returned trivial “print” solutions—directly replicating the output or providing simple ordinal mappings—rather than constructing compressed generative programs. Their prediction accuracy and abstraction performance degraded substantially with input complexity, indicating a reliance on memorization rather than genuine synthesis.
 - Model Version Fragility: Newer model versions displayed inconsistent advancement, and in some cases, regression compared to previous iterations. This suggests that model improvement is heavily tied to training data size rather than a qualitative leap in reasoning capacity.
 - Hybrid Neurosymbolic Outperformance: The CTM/BDM hybrid model consistently outperformed LLMs in proof-of-concept tasks focused on short sequence abstraction and prediction, successfully uncovering minimal generative models and showing complexity measures congruent with the input data’s structure. Theoretical analysis attributes this to the hybrid model’s grounding in algorithmic probability and optimal Bayesian inference (Solomonoff induction), which, in principle, guarantees universal intelligence.
 
6. Implications for AGI and Superintelligence Assessment
The SuperARC Test exposes fundamental limitations in LLM-based systems regarding AGI and ASI claims. Since current statistical models achieve mastery primarily in perceptual fluency and memorization rather than in model-based synthesis and planning, their ability to generalize beyond training data is questionable within this framework.
By explicitly targeting abilities central to intelligence—explanation, abstraction, inductive synthesis, and planning—the SuperARC Test establishes a robust evaluation axis for both natural and artificial agents. The demonstrable gap between LLMs and CTM/BDM hybrids in generating minimal, predictive models indicates the necessity for future AI research to integrate algorithmic probability and recursive synthesis rather than relying solely on large-scale data-driven learning.
7. Prospective Applications and Future Directions
SuperARC is formulated to be broadly applicable across diverse AI paradigms and agnostic to the architecture under evaluation. Its open-ended, contamination-resistant design is suitable for continuous frontier model evaluation as algorithmic benchmarks evolve. The CTM/BDM hybrid’s theoretical properties imply that integration of neurosymbolic, inference-driven approaches could form the basis for more generalizable intelligence in future AI systems.
A plausible implication is that future developments in artificial intelligence, particularly those aimed at AGI or ASI, will require moving toward models that manifest optimal inference and synthesis abilities as dictated by algorithmic probability and minimal description length, rather than further scaling of data and parameter count.
SuperARC thus constitutes a fundamental reorientation of intelligence evaluation, distinguishing systems capable of algorithmic synthesis and genuine generalization from those that merely replicate observable data, and providing a contamination-resistant, principled measure of progress toward general or superintelligent artificial agents.