CLRS-30: Algorithmic Benchmark

Updated 3 February 2026

CLRS-30 is a comprehensive benchmark suite that tests neural models on the step-by-step execution of 30 classical algorithms using graph-based data representations.
It employs an encode–process–decode paradigm to generate and assess detailed execution traces, allowing precise measurement of algorithmic reasoning and OOD generalization.
The benchmark has spurred innovations in neural architecture design, including advanced GNNs, memory-augmented methods, and open-book reasoning strategies.

The CLRS-30 Algorithmic Benchmark is a comprehensive suite for evaluating neural algorithmic reasoning, centered on systematic stepwise execution of classical algorithms. Designed around the encode–process–decode paradigm, it enables detailed assessment of neural models’ ability to learn and generalize complex algorithmic computations across a broad range of tasks. The benchmark is widely utilized as a gold standard for empirical and theoretical innovation in neural algorithmic reasoning, serving as a unifying testbed for graph neural networks (GNNs), dedicated architectures, and multimodal systems engaging with algorithmic data.

1. Scope and Design of the CLRS-30 Suite

CLRS-30 spans 30 canonical algorithms from the "Introduction to Algorithms" textbook (Cormen et al.), including sorting (Insertion Sort, Quicksort, Heapsort), searching (Binary Search, Minimum, Quickselect), graph algorithms (BFS, DFS, Bellman–Ford, Dijkstra, Prim/MST, SCC, Bridges), dynamic programming (LCS, Matrix Chain Order, Optimal BST), geometric (Graham Scan, Jarvis’s March, Segments Intersect), string matching (Naïve, KMP), and greedy methods (Activity Selector, Task Scheduling) (Veličković et al., 2022). Each task is specified in a graph-centric format:

Inputs: node/edge/global features encoding algorithm-specific data (e.g., vertex weights, array values, pattern strings).
Ground-truth execution trace: a complete time series of ‘hints’ capturing intermediate algorithmic states per node/edge/global at each step, together with the final output.
Encoding: All data is represented as graphs with associated feature tensors (node features $X_V \in \mathbb{R}^{n\times d_v}$ , edge features $X_E \in \mathbb{R}^{n\times n\times d_e}$ ), compatible with a wide class of neural architectures.

Data is generated by running precise textbook implementations (with injected probes to collect intermediate state) on randomly sampled graph/array/string instances of variable size. Train/validation/test splits are provided, including out-of-distribution (OOD) evaluation on much larger input sizes, ensuring rigorous assessment of algorithmic generalization.

2. Evaluation Protocol and Metrics

Models are tasked to predict the full step-by-step execution of each algorithm—either in a recurrent (per-timestep) or equilibrium (fixed-point) manner. The principal metric is micro-averaged F $_1$ score across all classification probes (hints and outputs), computed per-step and averaged over the task trajectory and instances. For regression targets (e.g., path lengths), mean squared error is used. OOD generalization is a central criterion: models are trained on modest input sizes (e.g., $n=16$ ), and evaluated on much larger, structurally diverse graphs (e.g., $n=64$ or higher) (Veličković et al., 2022).

The benchmark permits evaluation under various settings:

Hint-supervised: models receive supervision at every intermediate step of the algorithm.
No-hint: models are trained only on input-output pairs, with or without auxiliary self-supervised objectives (Rodionov et al., 2023).
Single-task and multi-task: learning one algorithm at a time or all 30 jointly.

3. Model Architectures and Algorithmic Alignment

CLRS-30 has driven systematic examination of architectural alignment between classical algorithms and neural modules:

Message-Passing GNNs: Baseline models use node/edge/global updates with sum or max aggregation. The pointer graph network (PGN) and Triplet-GMPNN introduce gating and three-way edge reasoning for improved expressivity (Ibarz et al., 2022).
Edge-centric and DP-aligned GNNs: Explicit alignment with dynamic programming (DP) is formalized using polynomial spans and semiring tensor formulations, yielding architectural variants such as V $^3$ -GNNs and deep relational networks that mirror DP’s aggregation over triples (e.g., Floyd–Warshall) (Dudzik et al., 2022, Yu et al., 27 Jan 2026).
Recurrent Aggregators: LSTM-based sequential aggregation exploits natural orderings in list-based tasks, notably excelling on sorting and selection (Heapsort, Quickselect) (Xu et al., 2024).
Deep Equilibrium Models (DEAR): Fixed-point solvers replace T-step rollouts, finding the algorithm’s equilibrium output directly and enabling significant inference speedup without reliance on the ground-truth step count (Georgiev et al., 2024).
Architectures with explicit memory: Neural Priority Queues (differentiable analogues of classical PQs) and external memory augmentations address long-range dependencies and data structure imitation (Jain et al., 2023).
Attention and higher-order relational models: Triplet Edge Attention (TEA) and hybrid GNN–Transformer models (TransNAR) expand reasoning capacity via pair/triplet attention and MPNN–Transformer bridges (Jung et al., 2023, Bounsi et al., 2024).
Evolutionary Neural Architecture Search: Automated code-level search over architectural micro-choices via LMs generates variants that outperform hand-designed models on a majority of tasks (Chen et al., 2023).

4. Out-of-Distribution Generalization and Benchmarking

CLRS-30’s OOD regime reveals unique generalization challenges not seen in vision or language settings:

Standard deep learning augmentations and validation metrics are often uninformative; models easily saturate ID accuracy but exhibit substantial OOD failures (sometimes 0–20% accuracy on held-out sizes) (Mahdavi et al., 2022).
Input representation and data generation significantly influence OOD performance. Remedies include random scalar positional encodings (RSI), increased and diversified sampling (e.g., K-regular, two-community graphs), and perspective-invariant augmentations (Mahdavi et al., 2022).
OOD performance varies dramatically by algorithm family—string-matching and combinatorially complex graph tasks (KMP, Kruskal MST, SCC) remain most challenging for all major architectures (Jung et al., 2023, Veličković et al., 2022).
Recent architectures such as FloydNet, built on learnable global DP refinement, attain near-perfect OOD generalization (>99%) across the full suite, dramatically surpassing prior SOTA message-passing models (Yu et al., 27 Jan 2026).

5. Advanced Learning Paradigms and Multitask Reasoning

The benchmark supports exploration of multi-task, multitree, and memory-augmented learning strategies:

Generalist processors: Triplet-GMPNN and related architectures show that hard parameter sharing enables a single processor to stably execute all 30 algorithms with minimal accuracy drop versus specialists, provided appropriate representation and training regimes (Ibarz et al., 2022).
Branching-model search (AutoBRANE): Layer-wise learned task branching using gradient-affinity SDP relaxations yields interpretable task hierarchy and +3–4% multitask accuracy gains while reducing resource usage (Li et al., 30 Nov 2025).
Open-Book Reasoning: Attention-based retrieval over auxiliary examples at inference significantly boosts accuracy (avg. ~83%) and allows interpretability of inter-task structural affinity (Li et al., 2024).
Reinforcement learning and imitation: Casting algorithm execution as a Markov Decision Process (MDP) with GNN-based policies provides solution validity by construction, supports multiple correct outputs, and extends to NP-hard settings (Schutz et al., 23 Sep 2025).
Unsupervised/self-supervised training: Contrastive augmentation over equivalence classes and trajectory regularization allows high OOD performance in the absence of intermediate hints, often matching or surpassing hint-supervised baselines (Rodionov et al., 2023).

6. Impact, Extensions, and Analysis

The benchmark has spurred a range of theoretical, empirical, and cross-domain advances:

Algorithmic inductive biases: Fine-grained alignment between neural updates and symbolic dynamics (e.g., DP recursion, edge-centric triplets, memory modules) is critical for sample efficiency and OOD robustness (Dudzik et al., 2022).
Expressiveness: FloydNet demonstrates that learned DP-style refinement at the all-pairs tensor level attains 3-WL/2-FWL expressiveness, strictly exceeding standard GNNs’ graph distinguishing power (Yu et al., 27 Jan 2026).
Data-structure imitation: Augmenting GNNs with task-matched differentiable data structures (PQ, stack, deque) is effective both for algorithmically structured and real-world data (e.g., molecular property benchmarks, where CLRS-pretrained modules yield up to +6% gains) (Wu et al., 24 Oct 2025).
Foundation models: Evaluation of LLMs with tool use demonstrates that few-shot prompting with code interpreters can surpass GNN specialists on most CLRS tasks, prompting a reassessment of what constitutes valid generalization (McLeish et al., 2024).

Further open directions include: scalable approximations to global/DW-based architectures; building multimodal and symbolic-neural hybrids; automation of architecture generation and span discovery from algorithmic specifications; and development of more discriminative OOD tasks probing beyond textbook algorithmic execution.

7. Resources and Reproducibility

The benchmark is implemented in open-source code (Python, JAX/Haiku) with reproducible data generation, reference algorithms, and full documentation, available at https://github.com/deepmind/clrs (Veličković et al., 2022). Splits, pipelines, and evaluation scripts align with published baselines, facilitating rapid implementation of new architectures, data regimes, and research ideas. The design and community adoption of CLRS-30 position it as the central platform for rigorous, unified investigation in neural algorithmic reasoning research.