CellARC: A Synthetic CA Benchmark

Updated 28 June 2026

CellARC is a synthetic abstraction-and-reasoning benchmark constructed from 1D cellular automata, enabling systematic comparisons of neural and symbolic models.
It features tunable difficulty knobs—such as alphabet size, radius, Langton’s lambda, and per-cell entropy—to precisely control task complexity.
Data episodes combine five support (input, output) pairs with a masked query, facilitating evaluation via in-context learning and task embedding methods.

CellARC is a synthetic abstraction-and-reasoning benchmark constructed from multicolor one-dimensional cellular automata (1D CA), designed to evaluate the ability of models—both neural and symbolic—to discover and generalize new rules under tightly controlled information, parameter, and data budgets. Each CellARC episode consists of five (input, output) support pairs and a query, serialized within 256 tokens, enabling direct comparison of small-scale neural models and symbolic baselines. The benchmark introduces tunable “difficulty knobs” including alphabet size, radius, rule family, Langton’s lambda ( $\lambda$ ), window coverage, and per-cell entropy, providing an explicitly quantifiable and reproducible framework for measuring abstraction, reasoning, and systematic generalization ability outside anthropocentric task priors (Lžičař, 11 Nov 2025).

1. Formal Framework: Cellular Automata and Task Construction

CellARC tasks are generated from formal 1D CA, parameterized by:

Alphabet size $k$ : Number of discrete cell states, $k \in \{2,\dots,6\}$ .
Radius $r$ : Local neighborhood size, $r \in \{1,2,3\}$ , so that each cell’s update depends on itself and its $r$ nearest neighbors on both sides.
Update rule $F$ : Maps each $(2r+1)$ -length local configuration to a next-state via $a_i^{t+1} = F(a_{i-r}^t,\ldots,a_{i+r}^t)$ , $a_i^t \in \{0,\dots, k-1\}$ .
Langton’s lambda ( $k$ 0): Activity level, capturing the fraction of non-quiescent transitions:

$k$ 1

where $k$ 2 is the count of neighborhood patterns mapping to the quiescent state $k$ 3.

Rule families are sampled from random tables, totalistic, outer/inner-totalistic, threshold, and linear-mod- $k$ 4 variants, systematically diversifying functional complexity.

2. Episodic Structure, Serialization, and Regimes

Each episode includes five support (input, output) pairs—generated by running $k$ 5 CA steps from random initial strings—and one query input, whose output is masked at training and must be inferred at test. Serialization uses six reserved tokens:

I (Input), O (Output), Q (Query), T (Target), M (Mask), E (End)

Training episodes flatten to $k$ 6, with all $k$ 7 output cells masked by $k$ 8. The total sequence is constrained to $k$ 9 tokens (median 142), ensuring computational feasibility and stringent comparison.

Two experimental regimes are defined:

ICL (In-context learning): All pairs serialized in a single input sequence.
TE (Task Embedding): Models train on support pairs independently, learn per-episode embeddings, and decode the query response conditioned on the embedding plus $k \in \{2,\dots,6\}$ 0.

3. Difficulty Knobs and Quantitative Metrics

CellARC exposes explicit controls over:

Alphabet size $k \in \{2,\dots,6\}$ 1 and radius $k \in \{2,\dots,6\}$ 2 (as above).
Rule family (random, totalistic, threshold, etc.).
Langton’s $k \in \{2,\dots,6\}$ 3—high values correspond to chaotic, unpredictable dynamics; low $k \in \{2,\dots,6\}$ 4 yields more regular evolution.
Query-weighted coverage:

$k \in \{2,\dots,6\}$ 5

where $k \in \{2,\dots,6\}$ 6 denotes all local windows in the query and $k \in \{2,\dots,6\}$ 7 those in the supports, quantifying the overlap between observed and to-be-inferred patterns.

Per-cell Shannon entropy:

$k \in \{2,\dots,6\}$ 8

$k \in \{2,\dots,6\}$ 9 denotes the empirical frequency of state $r$ 0 at cell $r$ 1, capturing diversity and unpredictability of the CA evolution.

Episodes may be synthesized with arbitrary values of these metrics, allowing systematic interpolation or extrapolation splits for out-of-distribution evaluation.

4. Dataset Composition and Splits

CellARC-100k contains 95,317 training episodes plus reproducible validation and test splits. Tasks are allocated as below:

Split	Count
Train	95,317
Validation	1,000
Test (Interp.)	1,000
Test (Extra.)	1,000
TestLLM_i	100
TestLLM_e	100

split_test_i (interpolation): Matches the parameter distributions of the training set.
split_test_e (extrapolation): Contains lowest-coverage queries ( $r$ 2), highest $r$ 3 and $r$ 4 in the chaotic regime; covers 100% chaotic rules.
Median sequence length is 142 tokens; maximum 256 tokens.

This structure guarantees reproducibility, fine-grained difficulty control, and facilitates direct comparison between learning systems under fixed memory and compute.

5. Methodological Baselines and Evaluation

Six baseline classes define state-of-the-art for the benchmark:

Symbolic: Copycat, Most-Frequent, Random, and the de Bruijn local-map (which counts all $r$ 5-neighborhoods in supports, with majority back-off).
Recurrent Neural Networks: LSTM (encoder-only, trained without teacher forcing).
Convolutional Neural Networks: 1D dilated CNN with residual and GELU blocks.
Transformer: Encoder-only with ReLU feedforward and sinusoidal position embeddings.
Recursive Reasoning: TRM (Tiny Recursive Model), HRM (Hierarchical Reasoning Model), Transformer-ACT, each with Adaptive Computation Time unrolled up to 16 steps.
Closed LLM: GPT-5 High prompted in few-shot mode.

Models were trained for 50 epochs on the 95,317 training cases (batch size 768, max seq-len 272) at scales of approximately 0.1M, 1M, and 10M parameters.

6. Comparative Performance and Observations

Main results for large models (per-token accuracy, %):

Model/Regime	Test Interp.	Test Extra.
GPT-5 High (LLM)	62.3	48.1
Oracle Ensemble†	65.4	35.5
Transformer (TE)	58.0	32.4
de Bruijn (symb.)	52.5	29.8
1D CNN (TE)	52.7	29.0
LSTM (TE)	51.0	27.9
HRM (TE)	50.8	28.2
Transformer-ICL	51.0	28.3
TRM (TE)	48.7	26.4
Copycat/Random/MostFreq	≈30/32/50	≈18/18/28

†Oracle picks, per episode, between Transformer-TE and de Bruijn (upper bound).

Key findings:

The encoder-only Transformer with task embeddings is the best performing open baseline (58.0%/32.4% on interpolation/extrapolation), outperforming the de Bruijn method.
Symbolic de Bruijn performs best under high coverage, but neural methods degrade more gracefully when the query is less covered by supports.
Large LLMs (GPT-5 High) close the gap further, especially on the hardest, chaotic, low-coverage tasks (48.1%).
An oracle ensemble that selects the best prediction per episode reveals a strong neuro-symbolic complementarity: symbolic methods dominate when much of the query overlaps with supports; neural models are more robust on low-coverage/high entropy tasks.
Enhanced recursive reasoning (TRM, HRM, Transformer-ACT) and inductive bias (CNN, LSTM) do not surpass the vanilla Transformer despite additional compute or structural bias.

7. Significance, Limitations, and Future Directions

CellARC provides an analytically controlled setting for rule discovery and abstraction, eliminating anthropocentric priors and allowing precisely tunable stress tests. Difficulty is quantifiable and broadly adjustable via $r$ 6. The fixed token budget ensures fair small-model vs. symbolic comparisons and supports efficient iteration.

Distinct advantages include unlimited difficulty-controlled dataset generation, explicit task-space metadata for ablations (ICL vs. TE vs. ACT, scaling, meta-learning, anticipated 2D CA extensions), and transparent evaluation splits. A plausible implication is that LLM pretraining enables amortized search over combinatorially many rule structures, reflected in superior performance on the most chaotic regimes.

Leaderboard access and further details are available at https://cellarc.mireklzicar.com (Lžičař, 11 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

CellARC: Measuring Intelligence with Cellular Automata (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CellARC.