Papers
Topics
Authors
Recent
2000 character limit reached

SuperARC: Recursive Compression Intelligence Test

Updated 19 November 2025
  • SuperARC is a framework that quantitatively assesses intelligence via recursive compression and generative model construction.
  • It evaluates agents by measuring causal inference and model abstraction rather than mere statistical pattern matching.
  • The framework leverages Algorithmic Information Theory and the Block Decomposition Method to benchmark compression efficiency and predictive generalization.

SuperARC is a framework for the quantitative assessment of machine and biological intelligence, designed to be agnostic with respect to architecture, training method, and species. Grounded in the formalism of Algorithmic Information Theory (AIT), SuperARC operationalizes a test of intelligence through recursive compression and algorithmic probability, with a specific emphasis on abduction and generative model construction. Unlike tests that rely on statistical compression and benchmark data, SuperARC seeks to evaluate the depth of causal inference and model abstraction displayed by the agent, thus enabling principled comparison across narrow AI, AGI, theoretical ASI, and even natural intelligence such as humans and animals (Hernández-Espinosa et al., 20 Mar 2025).

1. Foundational Principles

SuperARC is rooted in the premise that intelligence is the ability to construct the shortest effective model of observed data (abstraction via recursive compression) and to use this model for extrapolation and planning (prediction via algorithmic probability). The motivation is to avoid benchmark contamination and to prevent overfitting to narrow datasets, which can afflict standardized intelligence or capability tests. Evaluations are performed on tasks generated dynamically, preventing memorization or mere surface-level pattern matching.

Key mathematical underpinnings include:

  • Kolmogorov Complexity: K(x)K(x), the length of the shortest program computing xx on a universal prefix machine.
  • Algorithmic Probability / Coding Theorem:

logm(x)=K(x)±O(1)-\log m(x) = K(x) \pm O(1)

where m(x)=p:U(p)=x2pm(x) = \sum_{p : U(p) = x} 2^{-|p|} (the Solomonoff–Levin semi-measure).

  • Block Decomposition Method (BDM): An additive decomposition to approximate K(x)K(x) for large objects by summing the complexity of blocks determined via exhaustive enumeration for small Turing machines and the logarithm of their multiplicities.

This theoretical apparatus establishes that optimal compression and optimal prediction are functionally equivalent. Explicitly, if an agent can infinitely often compress a sequence, it is mathematically certain that it can infinitely often predict it, and vice versa.

2. Formal Test Protocol and Evaluation

Each SuperARC instance provides a sequence τ\tau (e.g., binary strings or integer sequences) as input. The agent must produce:

  • AA: An encoding program that recursively compresses τ\tau into a compact representation \partial.
  • A1A^{-1}: A decoding program that reconstructs τ\tau from \partial.

Programs are evaluated both for their descriptive parsimony and for their predictive generalization: A1A^{-1} is executed beyond the given data to test model extrapolation. Trivial solutions (e.g., print(...)) are penalized; programmatic solutions are rewarded in proportion to their compression ratio and correctness.

The scoring metric Φ\Phi is defined to reflect the complexity-normalized ability to both compress and predict:

  • ρ1=\rho_1= correct, non-print, non-ordinal;
  • ρ2=\rho_2= correct, ordinal;
  • ρ3=\rho_3= correct, print;
  • ρ4=\rho_4= incorrect;

A normalized score δk,j\delta_{k,j} is calculated using the harmonic mean of the block-normalized BDM of the target data and candidate program, and Φ\Phi aggregates scores across the types, heavily rewarding non-print, non-ordinal solutions.

Model ρ1\rho_1 ρ2\rho_2 ρ3\rho_3 ρ4\rho_4 δ1\delta_1 δ2\delta_2 δ3\delta_3 Φ\Phi
ASI (CTM/BDM) 1.000 0.000 0.000 0.000 1.000 0.000 1.000 1.000
chatgpt_4.5 0.00 1.00 0.00 0.00 0.000 0.419 0.000 0.042
claude_3.5 0.06 0.14 0.00 0.80 0.449 0.428 0.000 0.033
o1_mini 0.00 0.64 0.00 0.36 0.000 0.537 0.000 0.034
... ... ... ... ... ... ... ... ...

3. Theoretical Distinction: Recursive versus Statistical Compression

SuperARC explicitly differentiates between statistical and recursive compression. Statistical compressors (e.g., GZIP, LZW) execute pattern matching via substring redundancy or symbol frequency, closely aligned with Shannon entropy. These tools cannot model algorithmic simplicity in “climber” sequences—strings with significant underlying structure yet minimal statistical regularity.

By contrast, SuperARC employs BDM/CTM-based recursive compression, compelling agents to discover causal, mechanistic programs generating the sequence. The test thereby rewards true model induction, not mere frequency analysis. This provides a robust barrier against shortcut solutions and restricts trivial responses.

The formal equivalence (Section 4) between the capacity for compression and prediction is established using computably enumerable (super)martingales: a sequence is infinitely often compressible if and only if a left semicomputable martingale can exploit it, which is if and only if the sequence is infinitely often predictable.

4. Empirical Findings: Model Benchmarks

Experiments used canonical and frontier LLM architectures (Chronos, TimeGPT-1, Lag-Llama), comparing their performance to “ASI” (an oracle implementation based on CTM/BDM). Two primary sequence types were evaluated:

  • Random binary strings: All LLMs achieved chance-level prediction error (~50%); CTM/BDM performed as optimal.
  • Climber binary strings: LLMs marginally exceeded chance on simple cases (Lag-Llama ~70% accuracy), but performance did not generalize; CTM/BDM achieved 100% accuracy.

LLMs predominantly resorted to print- or ordinal-index solutions, failing to synthesize minimal programmatic representations. Their Φ\Phi scores remained well below 0.05, while CTM/BDM scored exactly 1.00 on all measures.

A plausible implication is that current LLMs lack the architectural bias or inference mechanism required for deep model abstraction or causal induction, exhibiting instead fragility, trivial memorization, and inability to generalize beyond training data.

5. Implications, Constraints, and Limitations

The unification of compression and predictive power places a lower bound on what can be meaningfully termed “intelligence” in both artificial and biological domains. Any competence that does not manifest as compression-resilient and generative model construction—not simply pattern recognition—is quantitatively exposed by SuperARC.

Notable implications and constraints include:

  • Encoding invariance: Guaranteed up to an additive constant, given a fixed encoding.
  • Resource bounds: All approximations to Kolmogorov complexity, such as BDM and CTM, are computationally bounded; the gold standard AIXI remains uncomputable.
  • Test domain: Current SuperARC applications focus on static sequence generation. Extensions to interactive or embodied intelligence, or to multimodal data, are under investigation.

Open questions concern the alignment of SuperARC-scores with human psychometrics, test robustness on non-sequential data, and the integration of neurosymbolic systems that combine LLM-like pretraining with CTM/BDM inference modules.

6. Significance for AGI, ASI, and the Future of Intelligence Evaluation

SuperARC provides a theory-grounded, contamination-resistant standard for the assessment of intelligence. Its operationalization of the universality bound via CTM/BDM approximates—within resource constraints—the theoretical performance of AIXI, permitting principled comparisons across agent types. Human and animal intelligence, narrow domain AI, and frontier AGI/ASI proposals can all be situated along the same metric axis, provided they generate recursive, not statistical, models.

This suggests that advancements toward AGI and ASI demand architectural solutions capable of explicit, minimal program induction. Failure to achieve compression-predictive equivalence, as exposed by SuperARC, provides a diagnostic for shallow, non-generalizing models. The broader implication is a reorientation of intelligence research toward the synthesis and manipulation of explicit programs, as opposed to optimization for statistical language mastery (Hernández-Espinosa et al., 20 Mar 2025).


References:

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SuperARC.