CLRS-Text Benchmark for Algorithmic Reasoning
- CLRS-Text Benchmark is a standardized dataset generator that converts classical algorithm execution traces into structured text, facilitating token-level evaluation.
- It extends the original CLRS benchmark by serializing inputs, execution traces, and outputs for thirty textbook algorithms, supporting both in-distribution and OOD testing.
- Empirical results show that hybrid models like TransNAR improve length generalization and output shape accuracy, addressing LLM limitations in algorithmic reasoning.
CLRS-Text Benchmark is a standardized procedural dataset generator designed to evaluate the algorithmic reasoning capabilities of machine learning models, especially LLMs, by transforming execution traces of classical algorithms into natural-language text format. CLRS-Text extends the CLRS Algorithmic Reasoning Benchmark, which itself consolidates graph-based trajectory datasets for thirty textbook algorithms drawn from Cormen, Leiserson, Rivest & Stein’s “Introduction to Algorithms” (Markeeva et al., 6 Jun 2024). Its primary role is to provide a uniform and extensible framework for benchmarking generalist model performance across algorithm categories, supporting both in-distribution and out-of-distribution (OOD) generalization assessment.
1. Formal Definition and Core Principles
CLRS-Text samples instances of polynomial-time algorithms, producing a textual representation comprising (a) serialized inputs, (b) an internal algorithmic trace prompt, and (c) the final output. Given a task (where is the set of 30 supported algorithms), input size , and randomizer (algorithm-specific hyperparameters), the generator outputs:
- Prompt: Serialized input variables with descriptive names and bracketed values (arrays, matrices, scalars).
- Trace: The evolution of a selected main trace variable through each step.
- Output: Final value or structure (array, scalar, matrix) as a compact bracketed text representation.
All algorithms operate within polynomial time constraints, with CLRS-Text exposing only the main trace variable per sequence, keeping context manageable even for traces. TextConverter routines in Python process raw arrays into bracketed textual forms, ensuring deterministic and reproducible sample formatting amenable to token-level modeling (Markeeva et al., 6 Jun 2024).
2. Data Generation, Trace Serialization, and Construction Pipeline
Data acquisition in CLRS-Text proceeds by dynamically sampling inputs for each algorithm per specified size and distribution, executing the canonical Python pseudocode, and recording the evolution of alongside the final output . The graph-based representations of original CLRS are linearized to text, combining:
- Input serialization (“varname: [values],” etc.)
- Execution trace prompts (“initial_trace:”, then “trace | pred:” placeholders for predicted steps)
- Final output targets (“final_output | pred:”)
This schema abstracts the underlying graph topology, but preserves the full information flow by reconstructing node and edge features from textual inputs when required for hybrid models (Bounsi et al., 13 Jun 2024). New algorithms may be registered in the CLRS-Text codebase by specifying input samplers, execution functions, main trace choice, and formatting converter routines, inheriting CLRS’s modular architecture.
3. Catalog of Supported Algorithms
CLRS-Text covers thirty distinct algorithms spanning sorting, selection, divide-and-conquer, greedy strategies, dynamic programming, graph traversal and optimization, string matching, and geometric computations. Each is accompanied by formal problem definition, input/output specifications, and unique trace variable selection. Examples include:
| Algorithm | Category | Main Trace Variable (Textual) |
|---|---|---|
| Insertion Sort | Sorting | Array after each insertion |
| Binary Search | Search | Midpoints probed |
| Matrix Chain Order | Dynamic Programming | DP pointer-matrix after each step |
| Dijkstra | Graph Shortest Paths | Distance array updates |
| Graham Scan | Geometry (Convex Hull) | Stack after each point |
| Naïve Matcher | String Matching | Current shift index |
Each task is sampled across varying ; traces and outputs are always rendered in deterministic bracketed form (e.g., , ), facilitating direct token-level modeling and output verification (Markeeva et al., 6 Jun 2024).
4. Evaluation Methodology and Metrics
Benchmark evaluation is performed via next-token cross-entropy loss on both trace prediction and final output, employing token-level models (e.g., Gemma 2B). Core metrics include:
- Exact-match accuracy: Fraction of samples where predicted output array or scalar exactly matches ground truth.
- Shape Score: Binary indicator of output tensor shape correctness.
- Parse Score: Fraction of outputs which parse as valid numeric/list syntax.
- CLRS Score: Element-wise match rate (%) between predicted and real output values, set to zero if shape mismatch occurs.
Experiments use both zero-shot and two-shot inference protocols; two-shot refers to prompting with exemplars of the same task prior to test queries. Results are stratified by algorithm, size bucket, and positional encoding regime (Bounsi et al., 13 Jun 2024, Markeeva et al., 6 Jun 2024).
5. Empirical Findings and Model Limitations
Fine-tuning LLMs on CLRS-Text yields near-perfect accuracy for in-distribution sizes across most algorithms. Models using Randomized Positional Encodings (RPE) exhibit improved interpolation and mild extrapolation compared to standard learned positional embeddings, aligning with prior graph-based results (Bounsi et al., 13 Jun 2024, Markeeva et al., 6 Jun 2024). However, all variants display marked performance degradation for significant extrapolation (sizes >2–4 training), with outputs collapsing to chance for large . A plausible implication is that sequential, autoregressive decoding constrains parallel algorithmic reasoning over large structured states.
TransNAR—a hybrid of Transformer and GNN-based Neural Algorithmic Reasoner (NAR)—demonstrates significant gains on CLRS-Text, especially for length generalization and output shape correctness. For example, in OOD on Graham Scan (size 14), TransNAR achieves CLRS Score 0.25–0.30 versus Transformer baseline 0.0, yielding up to pp absolute OOD improvement. This hybrid architecture remedies common LLM failure modes on algorithmic text: brittleness to unseen lengths and shape-mismatch hallucinations (Bounsi et al., 13 Jun 2024).
6. Extensibility, Open Challenges, and Research Directions
CLRS-Text is designed for extensibility: additional algorithms require only registry entry, trace variable selection, and converter function implementation, leveraging the CLRS modular pipeline. Open empirical challenges include:
- Developing LLM architectures that robustly extrapolate in : Current models are limited by positional encoding and sequential prediction constraints.
- Embedding explicit index hints and progressive cross-attention from intermediate NAR states to address “index-search” algorithm brittleness.
- Distilling hybrid transformer-GNN knowledge into unimodal LLMs.
- Advancing structure-aware attention, hybrid parallel prediction, and better positional encoding schemes to close the “extrapolation gap.”
Potential applications include standardized comparison of text-executing LLMs, targeted algorithmic fine-tuning, and research on algorithmic generalization, especially for tasks requiring large-scale parallel variable updates (Markeeva et al., 6 Jun 2024, Bounsi et al., 13 Jun 2024).
7. Impact and Integration with Related Benchmarks
CLRS-Text occupies a critical niche at the intersection of graph-centered algorithmic reasoning (CLRS, SALSA-CLRS) and text-based LLM evaluation. By mirroring CLRS-30’s catalog, but transforming execution to compact token representations, CLRS-Text enables direct, rigorous assessment of LLM reasoning over classical algorithmic domains. It provides a procedural, extensible framework suitable both for fair cross-publication comparison and for driving innovation in neural algorithmic reasoning methodologies (Veličković et al., 2022, Minder et al., 2023).
A plausible implication is that benchmarks like CLRS-Text will continue to play a central role in the development and evaluation of generalist reasoning models, as bridging the gap between parallel, graph-based solvers and sequential, text-centric architectures remains a core challenge for the field.