AlgBench: Algorithm Reasoning Benchmark

Updated 15 January 2026

AlgBench is an algorithm-centric benchmark that isolates true algorithmic reasoning in LRMs, moving beyond memorization to assess structural and optimization skills.
It features a contamination-free dataset with over 3,000 original problems spanning 27 fundamental algorithms, categorized by structural and difficulty tiers.
Empirical results highlight high accuracy on non-optimized tasks while exposing deficiencies in global-optimized and heuristic-optimized reasoning, including issues like strategic over-shifts.

AlgBench is an expert-designed benchmark for evaluating large reasoning models (LRMs) under an algorithm-centric paradigm. It is motivated by the limitations of existing reasoning benchmarks—which tend to focus on problem-solution memorization or generic code-generation—and seeks to isolate and assess models’ true mastery of classical algorithmic principles. By constructing over 3,000 original problems spanning 27 fundamental algorithms, AlgBench provides a contamination-free platform with a comprehensive taxonomy and stratified difficulty, facilitating rigorous measurement of both structural and optimization-oriented algorithmic reasoning across state-of-the-art LRMs (Sun et al., 8 Jan 2026).

1. Conceptual Foundation and Motivation

The primary objective of AlgBench is to answer whether LRMs genuinely internalize algorithmic reasoning or merely memorize data distributions of problem-solution pairs. Prevailing benchmarks such as MATH500, AIME, and LiveCodeBench advance verification of mathematical and code-generation ability but fail to probe procedural and abstract algorithm-centric reasoning. AlgBench addresses this gap by:

Transitioning from a problem-centric to an algorithm-centric evaluation, isolating individual algorithms such as prefix sums, dynamic programming variants, Dijkstra’s, and A* search.
Ensuring contamination-free evaluation through ACM-level expert authorship, with no overlap with public competitive-programming datasets (e.g., Codeforces, LeetCode).
Taxonomizing algorithms to expose structural and optimization-related generalization limits.

This design facilitates the detection of both strengths and fundamental bottlenecks in recent LRMs' algorithmic reasoning (Sun et al., 8 Jan 2026).

2. Taxonomy of Algorithmic Categories

AlgBench introduces a dual-axis taxonomy reflecting both structural properties and optimization strategies:

Euclidean-structured: Algorithms defined over 1D, indexable structures (e.g., arrays), leveraging continuous indices or intervals. Examples: Difference Array (ID), Prefix Sum (PS), Binary Search (BS).
Non-Euclidean-structured: Algorithms on topologies where connectivity supersedes geometric distance, such as trees or graphs. Examples: Tree Diameter (TDG), Bipartite Matching (BGM), Network Flow (NF).
Non-optimized: Procedures lacking complexity-reducing optimizations; includes brute-force search and canonical traversals (BFS, DFS).
Local-optimized: Strategies that rely on myopic or local decisions, usually via greedy or relaxation steps. Examples: Greedy algorithms, SSSP (Dijkstra, Bellman–Ford), Minimum Spanning Tree (Kruskal, Prim), Difference Constraints.
Global-optimized: Dynamic programming algorithms that maintain state tables for optimal global solutions across overlapping subproblems. Examples include Linear DP (LDP), Tree DP (TDP), Bitmask DP (BLDP), Multi-source Shortest Path (Floyd–Warshall), LCA preprocessing.
Heuristic-optimized: Search methods leveraging admissible heuristic functions to combine actual and anticipated cost. Examples: A* (AS), Iterative Deepening A* (IDAS) (Sun et al., 8 Jan 2026).

Each problem is designed to require exactly one algorithmic strategy, preventing conflation of methodologically distinct skills.

3. Dataset Construction and Structure

AlgBench’s dataset comprises 3,000+ original problems, manually authored and stratified as follows:

Algorithm Isolation: Each instance is explicitly solvable by a single algorithm, eliminating ambiguity.
Difficulty Tiers: For each algorithm, problems are partitioned into easy, medium, and hard, reflecting time complexity, space complexity, and challenging state space.
Contamination Prevention: Problems are novel, with no analogues in public online judges.
Prompt Standardization: All inputs employ LaTeX-formatted mathematical statements for clarity and rigor, e.g., explicit DP recurrences:

$\text{dp}[i][k] = \max_{0\le j<i}\{ \text{dp}[j][k-1] + w(j,i) \}$

or heuristic constraints for A*:

$f(n) = g(n) + h(n), \quad h(n)\leq \text{true\_cost}(n)$

This ensures that each item explicitly encodes the corresponding algebraic or combinatorial structure (Sun et al., 8 Jan 2026).

4. Evaluation Protocol and Metrics

The primary metric is Pass@1 accuracy: a model’s single highest-probability response is correct if it matches the ground truth. Evaluation incorporates:

Difficulty Normalization: Z-score normalization per model and task, rescaled to $[0,1]$ for performance comparison:

$z_{ij} = \frac{x_{ij} - \mu(x_{i\cdot})}{\sigma(x_{i\cdot})}, \qquad s_{ij} = \frac{z_{ij} - \min(z)}{\max(z) - \min(z)}$

Model Pool: Benchmarking covers 25 major models, including Gemini-3-Pro, DeepSeek-v3.2-Speciale, GPT-o3, Qwen3-235B, and others.
Prompting Constraints: Models are instructed to use specific algorithms and code execution is disabled to focus evaluation on reasoning, not tool integration (Sun et al., 8 Jan 2026).

5. Empirical Findings

Performance across six categories for frontier models is summarized below:

Category	DeepSeek-v3.2-S	Gemini-3-Pro	GPT-o3
Euclidean-structured	0.88	0.83	0.75
Non-optimized	0.92	0.89	0.90
Local-optimized	0.69	0.63	0.68
Non-Euclidean	0.70	0.70	0.69
Global-optimized	0.49	0.43	0.45
Heuristic-optimized	0.49	0.30	0.39

Key observations:

LRMs achieve high accuracy on non-optimized and Euclidean-structured tasks, but performance drops to approximately 70% on non-Euclidean and local-optimization tasks.
Notably, accuracy on global-optimized (dynamic programming) and heuristic-optimized tasks is under 50%, indicating a persistent deficiency in global state reasoning (Sun et al., 8 Jan 2026).

6. Error Analysis: Strategic Over-Shifts

A salient error mode identified is the “strategic over-shift.” Models initially construct correct algorithmic solutions, particularly in DP tasks, but then abandon these strategies prematurely—typically when required to emit low-entropy (deterministic, necessary) tokens such as closing brackets or numerical constants. This phenomenon is characterized quantitatively:

The token entropy at position $i$ , $H(p_i) = -\sum_{v\in V} p_i(v)\log p_i(v)$ , is analyzed to reveal that indispensable low-entropy tokens coincide with abrupt strategy changes.
Strategic over-shifts are interpreted as a byproduct of reinforcement learning fine-tuning regimes that penalize low-entropy output uniformly, inadvertently discouraging models from completing valid procedural constructs (Sun et al., 8 Jan 2026).

7. Implications and Future Directions

AlgBench’s outcomes have several immediate implications for LRM development:

Limitations of Problem-Centric RL: Current fine-tuning paradigms focus on surface-level solution correctness rather than the procedural or structural understanding of algorithms. This leads to failure modes such as strategic over-shifts and limited scalability to challenging algorithm classes.
Algorithm-Centric Training: Explicit algorithm-centric training is recommended, emphasizing reasoning about the conditions and proof obligations of classical algorithms (e.g., optimal substructure in DP, admissibility in heuristic search).
Agentic or Tool-Guided Reasoning: Augmenting LRMs with external or auxiliary agentic frameworks that can support completion through low-entropy syntactic steps.
Benchmark Expansion: Proposals include adding algorithms such as segment trees or Aho–Corasick, quantifying category-wise difficulty, and scaling to support robust pre-training or fine-tuning regimes (Sun et al., 8 Jan 2026).

A plausible implication is that overcoming strategic over-shifting and closing the global-optimized gap will require new training signals that distinguish indispensable low-entropy tokens from high-entropy, less-structural outputs.

AlgBench’s algorithmic focus situates it uniquely among active learning and symbolic manipulation benchmarks. While ALdataset and ALBench address data selection strategies and object detection, respectively, and ASyMOB targets symbolic math manipulation, none probe the core procedural abstractions in algorithmic reasoning. BenchNGS, focused on NGS alignment benchmarking, raises analogous issues of structural reproducibility but in the context of sequence mapping rather than generic algorithmic reasoning (Zhan et al., 2020, Feng et al., 2022, Shalyt et al., 28 May 2025, Rahman et al., 2015).

By providing a structured, contamination-free, and algorithm-centric test bed, AlgBench sets a foundation for quantitatively characterizing progress in LRM algorithmic reasoning and systematically illuminating where further innovations in architecture, training protocols, and agentic augmentation are most needed (Sun et al., 8 Jan 2026).