Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

Published 8 Apr 2026 in cs.CL and cs.CY | (2604.06799v1)

Abstract: Algebraic reasoning remains one of the most informative stress tests for LLMs, yet current benchmarks provide no mechanism for attributing failure to a specific cause. When a model fails an algebraic problem, a single accuracy score cannot reveal whether the expression was too deeply nested, the operator too uncommon, the intermediate state count too high, or the dependency chain too long. Prior work has studied individual failure modes in isolation, but no framework has varied each complexity factor independently under strict experimental control. No prior system has offered automatic generation and verification of problems of increasing complexity to track model progress over time. We introduce a nine-dimension algebraic complexity framework in which each factor is varied independently while all others are held fixed, with problem generation and verification handled by a parametric pipeline requiring no human annotation. Each dimension is grounded in a documented LLM failure mode and captures a structurally distinct aspect of algebraic difficulty, including expression nesting depth, simultaneous intermediate result count, sub-expression complexity, operator hardness, and dependent reasoning chain length. We evaluated seven instruction-tuned models spanning 8B to 235B parameters across all nine dimensions and find that working memory is the dominant scale-invariant bottleneck. Every model collapses between 20 and 30 parallel branches regardless of parameter count, pointing to a hard architectural constraint rather than a solvable capacity limitation. Our analysis further identifies a minimal yet diagnostically sufficient subset of five dimensions that together span the full space of documented algebraic failure modes, providing a complete complexity profile of a model's algebraic reasoning capacity.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a multi-dimensional framework that isolates nine algebraic complexity dimensions to diagnose LLM reasoning failures.
It reveals a scale-invariant working memory ceiling and exponential error compounding in sequential reasoning tasks across various models.
The study demonstrates that targeted profiling using a minimal set of five dimensions can guide architectural improvements and benchmark progress.

Diagnosing Algebraic Reasoning Failures in LLMs: A Multi-Dimensional Framework

Motivation and Framework Overview

Algebraic reasoning constitutes a decisive testbed for the cognitive and symbolic abilities of LLMs, yet standard evaluation practices rely on aggregate accuracy scores over heterogeneous problem sets, thereby occluding mechanistic sources of failure. This paper introduces a structural framework that operationalizes algebraic difficulty across nine orthogonal dimensions—syntactic length, tree depth, operator hardness, working memory load, compositional branching, solution ambiguity, counting load, sequential chain length, and numeric magnitude—each directly mapped to documented failure modes in the literature (2604.06799). Unlike prior work focused on single dimensions or uncontrolled blends, the proposed system allows parametric variation of each complexity factor under strict isolation, yielding interpretable accuracy curves and identification of universal architectural bottlenecks.

A fully automated generator creates, verifies, and formats algebraic expressions, ensuring coverage and scalability. The design enables dynamic, extensible benchmarking: increasing the difficulty on any axis requires only a parameter adjustment rather than a corpus redesign, supporting longitudinal evaluation as models mature.

Core Dimensions and Their Empirical Manifestations

Working Memory (D4)

The study’s strongest finding is that transformer self-attention imposes a scale-invariant working memory ceiling. Across seven instruction-tuned models (8B–235B parameters), all models collapse abruptly at 20–30 parallel branches, regardless of parameter count. This behavior is consistent with theoretical results (e.g., [gong2023working], [markeeva2024clrs]), indicating that transformer architectures lack explicit register mechanisms to sustain concurrent partial results. Neither increased parameter count nor additional training data mitigates this bottleneck.

Tree Depth (D2) and Compositional Branching (D5)

Tree depth demand scales exponentially: a depth- $d$ tree requires $2^d$ simultaneous partial results. Larger models tolerate higher depths but ultimately succumb to the same structural constraint. Compositional branching (D5) isolates local sub-expression complexity within branches; model-specific divergences, such as Llama 3 8B failing at only 3 operations per branch, empirically confirm that these are genuinely independent axes of difficulty. The dichotomy between parallel branch count (D4) and per-branch complexity (D5) is sharp, supporting the argument for multidimensional profiling.

Sequential Chain Length (D8) and Counting Load (D7)

Sequential chain length induces error compounding; accuracy degrades multiplicatively with each dependent reasoning step. GPT-4o Mini displays a step-function collapse at chain length 5, and all models fail completely by step 12. Counting tasks (D7) reveal the widest model divergence: Claude 3.5 Haiku sustains perfect accuracy ( $K = 300$ ) by recognizing repeated addition patterns and switching strategies, while Llama 3 8B fails at $K = 25$ due to tokenization-induced cardinality tracking limits ([zhang2024tokenization], [chang2024]). Crucially, performance on counting problems is orthogonal to structural reasoning measures; a model’s profile on D2, D4, D5, and D8 cannot predict counting ability.

Secondary Dimensions and Interaction Effects

Syntactic length (D1) and numeric magnitude (D9) are shown to be proxy or amplifying variables rather than independent predictors. Failures at extreme expression lengths invariably reflect underlying tree depth or sequential chain limits, and magnitude difficulties only manifest under operators with quadratic digit interaction (e.g., multiplication, D3 × D9 interaction). Solution ambiguity (D6), measured as the number of valid solution paths, is subsumed by operator familiarity and pattern recognition, both adequately indexed by operator hardness (D3).

The framework acknowledges several interpretable correlations; for example, D1 and D8 correlate at extreme lengths, and D9 amplifies D3-centric failures. Despite this, the independence of key axes is empirically reinforced through divergent model-specific failure curves, notably between D4 and D5.

Diagnostic Sufficiency and Minimal Profiling Set

Through comprehensive evaluation, the authors demonstrate that five dimensions—tree depth (D2), working memory (D4), compositional branching (D5), counting load (D7), and sequential chain length (D8)—jointly span all documented algebraic failure modes. This minimal set, with only 250 problems required for full profiling, is diagnosically sufficient. The remaining dimensions are derivable as interaction terms or proxies, except operator hardness (D3), whose catastrophic floor lies above the suite ceiling and would require extension into calculus-level tasks for complete coverage.

Implications and Future Directions

The architectural universality of working memory and sequential reasoning limits has pronounced implications: transformer-based LLMs, irrespective of scale or training data, will remain structurally incapable of exceeding fixed memory ceilings without fundamental changes. This points toward memory-augmented transformer designs, explicit register modules, or alternative paradigms for symbolic computation. The orthogonality of counting failures, rooted in tokenization, marks an unresolved challenge in representation learning.

The dynamism and granularity of the framework enable robust tracking of model progress and architectural improvements. Longitudinal benchmarking can inform targeted fine-tuning and hybrid system design. Furthermore, the planned comparison with human cognitive failure patterns may yield insights into commonality and divergence between biological and artificial reasoning, guiding both cognitive modeling and AI robustness research.

Conclusion

The nine-dimensional algebraic complexity framework provides a rigorous, diagnostic approach to understanding and benchmarking LLM reasoning failures. The scale-invariant constraint on working memory capacity, the severe compounding of sequential reasoning errors, and the independence of counting failures underscore that aggregate accuracy metrics are insufficient for evaluating mathematical cognition in LLMs. A minimal subset of five complexity axes offers a practical, interpretable structure for regular model evaluation and development. The benchmark is fully generative and extensible, ensuring continued relevance in the face of rapid model evolution. The findings delineate the boundaries of current architectures and lay the groundwork for both practical improvements and theoretical inquiry in LLM algebraic reasoning (2604.06799).

Markdown Report Issue