Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task

Published 14 Apr 2026 in cs.LG and cs.CL | (2604.12426v1)

Abstract: We investigate whether transformers use their depth adaptively across tasks of increasing difficulty. Using a controlled multi-hop relational reasoning task based on family stories, where difficulty is determined by the number of relationship hops that must be composed, we monitor (i) how predictions evolve across layers via early readouts (the logit lens) and (ii) how task-relevant information is integrated across tokens via causal patching. For pretrained models, we find some limited evidence for adaptive depth use: some larger models need fewer layers to arrive at plausible answers for easier tasks, and models generally use more layers to integrate information across tokens as chain length increases. For models finetuned on the task, we find clearer and more consistent evidence of adaptive depth use, with the effect being stronger for less constrained finetuning regimes that do not preserve general language modeling abilities.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that transformers adaptively allocate network depth based on the complexity of relational reasoning tasks.
Using techniques like logit lens and causal patching, the study reveals that larger models shift deeper for harder queries while simpler tasks use fewer layers.
Finetuning strategies critically impact adaptive depth use, with fully finetuned models excelling in specialized tasks at the expense of general language modeling.

Adaptive Depth Utilization in Transformers: A Detailed Analysis on Multi-Hop Relational Reasoning

Background and Motivation

The effective utilization of network depth in transformers has been a divisive topic, with recent large-scale empirical observations (“Do LLMs use their depth efficiently?” [Csordás et al., 2025]) claiming that the later layers of transformers contribute minimally to new computation, while theoretical works argue that the expressivity and success of transformers on complex algorithmic and reasoning tasks are predicated on substantial network depth. This study directly interrogates these competing hypotheses through controlled empirical analyses, focusing specifically on whether transformers allocate and exploit depth adaptively in accordance with task complexity—a property here termed "adaptive depth use".

To achieve precise measurement and evaluation, the authors employ the CLUTRR benchmark, a synthetic diagnostic suite wherein relational reasoning difficulty is systematically governed by the length of chain reasoning required (i.e., the number of relationship “hops”). Investigations target both open-weight pretrained transformers and models finetuned for this compositional reasoning challenge, with analysis mechanisms centering on the logit lens and causal patching to probe depth-specific prediction and integration dynamics.

Experimental Framework

Task Specification:

The experimental focus is the CLUTRR family relations task, with difficulty modulated via the number of explicit relationship compositions needed to deduce the correct answer. Importantly, all questions are posed as single-token next-word prediction problems, thus ensuring that all critical computation and reasoning are completed in a single forward pass.

Model Set:

A comprehensive lineup of models is considered: GPT-2, Pythia, Phi, Qwen2/2.5, and LLaMA families, with sizes ranging from 120M to 14B parameters for pretrained experiments, and various GPT-2/Pythia variants for finetuning.

Analytic Methods:

Logit Lens: Probes the decodability of correct answers from intermediate hidden states by projecting these states with the model's language modeling head, allowing quantification of correctness and semantic “readiness” at each layer.
Causal Patching: Assesses the depth/timestep at which token information is integrated, employing controlled interventions and counterfactual substitutions in the reasoning chain, monitoring at which network stage the model’s prediction becomes contingent on the modified relation.

Key Empirical Findings

Pretrained Models

Decodability and Early Processing

Across all examined pretrained models, hidden representations become highly decodable (i.e., the correct relational answer can be extracted with high probability) after roughly two-thirds of network depth. The final layers are observed to "diffuse" or redistribute predictions—entropy over the output answer space decreases in middle layers, then increases in later layers. This aligns with recent work suggesting that LLMs undergo multiple processing phases, with initial layers dedicated to feature synthesis and later layers engaged in distribution calibration [Lad et al., 2024; Lv et al., 2024].

Depth Adaptation to Task Difficulty

Larger models manifest a nontrivial degree of adaptive depth use on this task: for easier (shorter hop) queries, plausible predictions are achieved using fewer layers, whereas harder queries see a rightward (later-layer) shift in the rise of correctness and semantic decodability. Smaller models do not show strong gradations in depth usage with problem difficulty, implying model scaling is a factor in unlocking adaptive computation.

Information Integration Across Tokens

Causal patching reveals that for longer reasoning chains, models begin cross-token information mixing at earlier depths, with information about earlier relationships injected into the residual stream sooner. Integration of final relation information at the output token also starts earlier for harder tasks, suggesting a flexible partition of depth budget for information propagation and normalization.

Finetuned Models

Specialization and Generalization Dynamics

Both LoRA-adapted and fully finetuned models robustly solve the CLUTRR tasks in-domain; however, models fine-tuned only via LoRA preserve language modeling capabilities and generalize better to longer reasoning chains, while fully finetuned models (where all parameters are updated) lose generic language modeling functionality and fail to length-generalize beyond training distribution.

Differentiated Adaptive Depth Use

Full Finetuning: Strong, clear adaptive depth use is observed—correctness for harder tasks arises at later layers, and more of the total depth is devoted to information mixing and refinement. The order of relationship integration into the final answer is distinctly structured (first relation integrated last, last relation integrated first).
LoRA-Only Finetuning: Depth use is minimally adapted to difficulty—the depth at which answers are decodable does not shift with task hardness, and the overall layer utilization profile is almost indistinguishable from that of the base model.

Cross-token Information Routing and Attention

Analysis of attention and recovery trajectories indicates that fully finetuned models allocate more representational resources for integrating complex relationship types (father/uncle) relative to simple sibling relations. Patching experiments demonstrate directional asymmetry: information for more complex relationships propagates widely, making it harder to “revert” the answer by single-position patching.

Implications and Theoretical Considerations

The study provides concrete evidence for conditional adaptivity in transformer depth use: pretrained large models and, more emphatically, finetuned models can recruit additional layers for harder compositional reasoning problems. However, the constraint of simultaneously preserving general language modeling seems to limit the amount of “free” depth available for such dynamic allocation, especially when only lightweight parameter adaptation (LoRA) is used.

This raises several notable theoretical and practical points:

Depth as a computational bottleneck: While average layer usage may appear redundant due to mandatory detokenization and feature engineering functions in early layers, transformers indeed exploit surplus depth for problem-specific computation when trained or adapted appropriately.
Finetuning and catastrophic forgetting: Aggressive finetuning proceeds at the expense of generic competence and generalization, a tradeoff intrinsic to current supervised adaptation regimes.
Architectural and training-design leverage: Since family models with similar transformer architectures reach semantically meaningful representations at different depths, training data and tuning recipes likely exert a substantial effect on residual stream alignment and, by extension, the division of labor across depth.

Prospects for Future Research

Future developments should examine whether and how adaptive-depth allocation can emerge in more unconstrained, real-world reasoning settings—particularly for tasks not amenable to single-token answer curation. The development and evaluation of architectures enabling dynamic, input-conditional computational depth (e.g., universal/looped transformers [Dehghani et al., 2018; Fan et al., 2024]) is a logical next frontier. Further, understanding and controlling representational alignment across layer depth, perhaps by training objectives or regularization, could enable finer control over the structural stages of inference and information flow.

Conclusion

This paper demonstrates, via precise single-token reasoning tasks and targeted analytical tools, that transformers possess conditional adaptive depth use that scales with task complexity, especially after task-specific finetuning. The findings partially reconcile prior conflicting empirical results and reinvigorate theoretical perspectives positing the crucial role of depth in compositional reasoning. The analysis also foregrounds the sharp constraints imposed by multitask competence retention and raises targeted questions about how architectural choices and training paradigms might further unlock dynamic, problem-adaptive computation in transformer-based models.

Reference: “Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task” (2604.12426)

Markdown Report Issue