CLRS-Text Benchmark for Algorithmic Reasoning

Updated 15 December 2025

CLRS-Text Benchmark is a standardized dataset generator that converts classical algorithm execution traces into structured text, facilitating token-level evaluation.
It extends the original CLRS benchmark by serializing inputs, execution traces, and outputs for thirty textbook algorithms, supporting both in-distribution and OOD testing.
Empirical results show that hybrid models like TransNAR improve length generalization and output shape accuracy, addressing LLM limitations in algorithmic reasoning.

CLRS-Text Benchmark is a standardized procedural dataset generator designed to evaluate the algorithmic reasoning capabilities of machine learning models, especially LLMs, by transforming execution traces of classical algorithms into natural-language text format. CLRS-Text extends the CLRS Algorithmic Reasoning Benchmark, which itself consolidates graph-based trajectory datasets for thirty textbook algorithms drawn from Cormen, Leiserson, Rivest & Stein’s “Introduction to Algorithms” (Markeeva et al., 6 Jun 2024). Its primary role is to provide a uniform and extensible framework for benchmarking generalist model performance across algorithm categories, supporting both in-distribution and out-of-distribution (OOD) generalization assessment.

1. Formal Definition and Core Principles

CLRS-Text samples instances of polynomial-time algorithms, producing a textual representation comprising (a) serialized inputs, (b) an internal algorithmic trace prompt, and (c) the final output. Given a task $A\in\mathcal{A}$ (where $\mathcal{A}$ is the set of 30 supported algorithms), input size $n$ , and randomizer $\theta$ (algorithm-specific hyperparameters), the generator outputs:

Prompt: Serialized input variables with descriptive names and bracketed values (arrays, matrices, scalars).
Trace: The evolution of a selected main trace variable $Z_A$ through each step.
Output: Final value or structure (array, scalar, matrix) as a compact bracketed text representation.

All algorithms operate within polynomial time constraints, with CLRS-Text exposing only the main trace variable per sequence, keeping context manageable even for $O(n^2)$ traces. TextConverter routines in Python process raw arrays into bracketed textual forms, ensuring deterministic and reproducible sample formatting amenable to token-level modeling (Markeeva et al., 6 Jun 2024).

2. Data Generation, Trace Serialization, and Construction Pipeline

Data acquisition in CLRS-Text proceeds by dynamically sampling inputs for each algorithm per specified size and distribution, executing the canonical Python pseudocode, and recording the evolution of $Z_A$ alongside the final output $Y_A$ . The graph-based representations of original CLRS are linearized to text, combining:

Input serialization (“varname: [values],” etc.)
Execution trace prompts (“initial_trace:”, then “trace | pred:” placeholders for predicted steps)
Final output targets (“final_output | pred:”)

This schema abstracts the underlying graph topology, but preserves the full information flow by reconstructing node and edge features from textual inputs when required for hybrid models (Bounsi et al., 13 Jun 2024). New algorithms may be registered in the CLRS-Text codebase by specifying input samplers, execution functions, main trace choice, and formatting converter routines, inheriting CLRS’s modular architecture.

3. Catalog of Supported Algorithms

CLRS-Text covers thirty distinct algorithms spanning sorting, selection, divide-and-conquer, greedy strategies, dynamic programming, graph traversal and optimization, string matching, and geometric computations. Each is accompanied by formal problem definition, input/output specifications, and unique trace variable selection. Examples include:

Algorithm	Category	Main Trace Variable (Textual)
Insertion Sort	Sorting	Array after each insertion
Binary Search	Search	Midpoints probed
Matrix Chain Order	Dynamic Programming	DP pointer-matrix after each step
Dijkstra	Graph Shortest Paths	Distance array updates
Graham Scan	Geometry (Convex Hull)	Stack after each point
Naïve Matcher	String Matching	Current shift index

Each task is sampled across varying $n$ ; traces and outputs are always rendered in deterministic bracketed form (e.g., $[1\ 4\ 3\ 0]$ , $[[0\ 1\ 2],[1\ 0\ 3]]$ ), facilitating direct token-level modeling and output verification (Markeeva et al., 6 Jun 2024).

4. Evaluation Methodology and Metrics

Benchmark evaluation is performed via next-token cross-entropy loss on both trace prediction and final output, employing token-level models (e.g., Gemma 2B). Core metrics include:

Exact-match accuracy: Fraction of samples where predicted output array or scalar exactly matches ground truth.
Shape Score: Binary indicator of output tensor shape correctness.
Parse Score: Fraction of outputs which parse as valid numeric/list syntax.
CLRS Score: Element-wise match rate (%) between predicted and real output values, set to zero if shape mismatch occurs.

Experiments use both zero-shot and two-shot inference protocols; two-shot refers to prompting with exemplars of the same task prior to test queries. Results are stratified by algorithm, size bucket, and positional encoding regime (Bounsi et al., 13 Jun 2024, Markeeva et al., 6 Jun 2024).

5. Empirical Findings and Model Limitations

Fine-tuning LLMs on CLRS-Text yields near-perfect accuracy for in-distribution sizes across most algorithms. Models using Randomized Positional Encodings (RPE) exhibit improved interpolation and mild extrapolation compared to standard learned positional embeddings, aligning with prior graph-based results (Bounsi et al., 13 Jun 2024, Markeeva et al., 6 Jun 2024). However, all variants display marked performance degradation for significant extrapolation (sizes >2–4 $\times$ training), with outputs collapsing to chance for large $n$ . A plausible implication is that sequential, autoregressive decoding constrains parallel algorithmic reasoning over large structured states.

TransNAR—a hybrid of Transformer and GNN-based Neural Algorithmic Reasoner (NAR)—demonstrates significant gains on CLRS-Text, especially for length generalization and output shape correctness. For example, in OOD on Graham Scan (size 14), TransNAR achieves CLRS Score $\approx$ 0.25–0.30 versus Transformer baseline $\approx$ 0.0, yielding up to $+20$ pp absolute OOD improvement. This hybrid architecture remedies common LLM failure modes on algorithmic text: brittleness to unseen lengths and shape-mismatch hallucinations (Bounsi et al., 13 Jun 2024).

6. Extensibility, Open Challenges, and Research Directions

CLRS-Text is designed for extensibility: additional algorithms require only registry entry, trace variable selection, and converter function implementation, leveraging the CLRS modular pipeline. Open empirical challenges include:

Developing LLM architectures that robustly extrapolate in $n$ : Current models are limited by positional encoding and sequential prediction constraints.
Embedding explicit index hints and progressive cross-attention from intermediate NAR states to address “index-search” algorithm brittleness.
Distilling hybrid transformer-GNN knowledge into unimodal LLMs.
Advancing structure-aware attention, hybrid parallel prediction, and better positional encoding schemes to close the “extrapolation gap.”

Potential applications include standardized comparison of text-executing LLMs, targeted algorithmic fine-tuning, and research on algorithmic generalization, especially for tasks requiring large-scale parallel variable updates (Markeeva et al., 6 Jun 2024, Bounsi et al., 13 Jun 2024).

CLRS-Text occupies a critical niche at the intersection of graph-centered algorithmic reasoning (CLRS, SALSA-CLRS) and text-based LLM evaluation. By mirroring CLRS-30’s catalog, but transforming execution to compact token representations, CLRS-Text enables direct, rigorous assessment of LLM reasoning over classical algorithmic domains. It provides a procedural, extensible framework suitable both for fair cross-publication comparison and for driving innovation in neural algorithmic reasoning methodologies (Veličković et al., 2022, Minder et al., 2023).

A plausible implication is that benchmarks like CLRS-Text will continue to play a central role in the development and evaluation of generalist reasoning models, as bridging the gap between parallel, graph-based solvers and sequential, text-centric architectures remains a core challenge for the field.