Effective Reasoning Chains Reduce Intrinsic Dimensionality

Published 9 Feb 2026 in cs.CL, cs.AI, and cs.LG | (2602.09276v1)

Abstract: Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of LLMs on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents a novel metric quantifying the minimal parameter count (intrinsic dimensionality) needed to cross performance thresholds.
It demonstrates a strong inverse correlation, with up to a Spearman coefficient of 0.93, between intrinsic dimensionality and generalization accuracy.
Using LoRA-based fine-tuning and diverse reasoning strategies, the study provides actionable insights for optimizing LLM training and robustness.

Quantitative Analysis of Reasoning Chains via Intrinsic Dimensionality

Motivation and Problem Statement

The paper "Effective Reasoning Chains Reduce Intrinsic Dimensionality" (2602.09276) addresses a fundamental challenge in evaluating and optimizing reasoning strategies in LLMs: the lack of a principled, quantitative metric for assessing the effectiveness of different reasoning formulations. Previous work largely relied on qualitative hypotheses (e.g., structural guidance, increased test-time computation) or unstable quantitative proxies (e.g., trajectory length, token perplexity), resulting in inconsistent interpretations regarding generalization capacity and performance. By leveraging information-theoretic tools—particularly the concept of intrinsic dimensionality—the authors propose a robust metric to characterize the learnability and generalization afforded by reasoning chains, offering actionable insight into data formulation, model alignment, and training optimization.

Definition and Measurement of Intrinsic Dimensionality

Intrinsic dimensionality quantifies the minimum number of trainable parameters required to achieve a specified performance threshold on a given task, under a fixed model architecture. Rather than varying the model architecture or pretraining setup (Li et al., 2018; Aghajanyan et al., 2021), this paper fixes the model and manipulates the nature of supervision through diverse reasoning strategies—altering the outputs provided during training. The measurement employs the LoRA framework (Hu et al., 2022) for parameter-efficient fine-tuning, systematically sweeping LoRA ranks and target modules to determine the minimal parameter configuration that crosses the chosen accuracy threshold. This operationalizes intrinsic dimensionality in a way directly relevant to reasoning strategy effectiveness.

Experimental Design

The empirical analysis is conducted on the GSM8K dataset (Cobbe et al., 2021) and its OOD stress test variants (Mirzadeh et al., 2025; Shi et al., 2023; Gao et al., 2023), using two sizes of Gemma LLMs (Gemma-3 1B and 4B). Multiple reasoning strategies spanning direct answer (NoCoT), filler augmentations, concise and long CoT, code-based (Executed PoT, Simulated PoT), decompositional (Plan and Solve), and structurally varied (Critical CoT, High Review Ratio CoT) formulations are evaluated. Each strategy yields a distinct training dataset, and models are finetuned under LoRA configurations to compute intrinsic dimensionality. Generalization performance is assessed using in-domain and OOD test splits, and several baselines—trajectory length, token-level perplexity, and full-sequence KL divergence—are compared.

Numerical Results and Comparative Analysis

The principal finding is a strong inverse correlation between intrinsic dimensionality and generalization performance: reasoning strategies that reduce the intrinsic dimensionality of the task consistently achieve higher accuracy on both in-distribution and OOD splits. For Gemma-3 4B, the Spearman correlation between intrinsic dimensionality and overall accuracy is 0.93, substantially outperforming token perplexity (0.82), response length (0.31), and KL divergence (-0.17). These results are robust to threshold selection (across 70–90% accuracy), and are mirrored in the smaller Gemma-3 1B model (correlation 0.75). Notably, the code-based Executed PoT strategy achieves both the lowest intrinsic dimensionality and the strongest OOD accuracy, empirically validating the compressibility and robustness of execution-driven reasoning formulations.

Additional analysis reveals that effective reasoning chains are more efficiently compressed by larger models: Gemma-3 4B requires fewer parameters (lower intrinsic dimensionality) to reach high performance for compressible strategies, while ineffective reasoning chains (e.g., distractor-laden, filler text) exhibit higher intrinsic dimensionality, especially for larger models. This pattern indicates that reasoning quality, not sequence length or dataset size, governs learnability.

Implications and Future Directions

Theoretical implications are grounded in the Minimum Description Length (MDL) principle: effective reasoning chains facilitate better conditional data compression, rendering learning more tractable and generalizable. By offering a quantifiable and distribution-agnostic metric, intrinsic dimensionality enables principled selection and construction of reasoning data, potentially guiding annotation schemes, regularization methods, and adaptive alignment protocols for future LLM development.

Practically, intrinsic dimensionality measurement requires computational sweeps across LoRA ranks and modules, which may limit immediate deployment but motivates the exploration of surrogate metrics or more efficient approximations. The result that execution-based reasoning leads to lower intrinsic dimensionality and better OOD robustness suggests that future research should focus on hybrid reasoning strategies combining code execution, decomposition, and critical review for maximally compressible supervision.

Further avenues include validating these trends in other post-training settings (e.g., bootstrapped self-improvement, RL-based methods), analyzing coherence across reasoning trajectories, and developing scalable evaluation protocols for large-scale and multi-task settings.

Conclusion

This paper establishes intrinsic dimensionality as a highly predictive quantitative metric for reasoning chain effectiveness in LLMs, outperforming traditional baselines and aligning with information-theoretic generalization principles. The results implicate compressibility—not merely structural guidance or inference-time computation—as the driver of improved performance, and suggest that intrinsic dimensionality should be a core consideration for both empirical and theoretical work in reasoning-oriented AI. Future directions include finding computationally tractable proxies and exploiting this insight for more robust, generalizable, and efficient model training.

Markdown Report Issue