Papers
Topics
Authors
Recent
Search
2000 character limit reached

Block-Multihead Recurrence in ViTs

Updated 13 April 2026
  • The paper demonstrates that pretrained ViTs exhibit a block-recurrent structure, where a few recurrent blocks approximate deep layers and maintain high accuracy.
  • It formalizes the Block-Recurrent Hypothesis by tying parameters across contiguous phases, effectively reducing representational complexity.
  • Empirical analysis reveals token-specific dynamics and low-rank convergence, highlighting benefits in model compression, interpretability, and inference speed.

Block-multihead recurrence refers to the phenomenon in Vision Transformers (ViTs) whereby the sequential stack of transformer layers in deep models can be well-approximated by a small number of “blocks” that are repeatedly applied in sequence, with each block structurally identical to a standard ViT layer but with parameter tying across multiple depth steps. This approach is formalized in the Block-Recurrent Hypothesis (BRH) and operationalized through models such as Recurrent Approximations to Phase-structured TransfORmers (Raptor), which demonstrate that pretrained ViTs exhibit strong block-recurrent structure, leading to compact, dynamically interpretable programs with low representational complexity (Jacobs et al., 23 Dec 2025).

1. Block-Recurrent Hypothesis: Formal Definition

The Block-Recurrent Hypothesis (BRH) posits that a pretrained Vision Transformer of depth LL with intermediate activations f(x)RT×df_{\ell}(x) \in \mathbb{R}^{T\times d} for layers =0,...,L\ell = 0, ..., L can be accurately rewritten using only kLk \ll L distinct blocks B1,...,Bk\mathcal{B}_1, ..., \mathcal{B}_k, each applied recurrently. In this formulation, depth is partitioned into contiguous “phases,” each implemented by one of the recurrent blocks:

f(x)(Bk(nk)B1(n1))(x)f_\ell(x) \approx \left(\mathcal{B}_k^{(n_k)} \circ \cdots \circ \mathcal{B}_1^{(n_1)}\right)(x)

Here, Bj(nj)\mathcal{B}_j^{(n_j)} denotes block Bj\mathcal{B}_j applied njn_j times in sequence, with n1++nk=Ln_1+\dots+n_k = L. Each layer f(x)RT×df_{\ell}(x) \in \mathbb{R}^{T\times d}0 thus shares parameters with layers in the same phase.

An equivalent recurrent description for layer indices f(x)RT×df_{\ell}(x) \in \mathbb{R}^{T\times d}1 is:

f(x)RT×df_{\ell}(x) \in \mathbb{R}^{T\times d}2

The f(x)RT×df_{\ell}(x) \in \mathbb{R}^{T\times d}3-BRH refinement demands that the tied-block program matches all intermediate activations of the original model within a Frobenius-norm tolerance f(x)RT×df_{\ell}(x) \in \mathbb{R}^{T\times d}4.

2. Architecture of the Block-Multihead Recurrent Module

Each recurrent block f(x)RT×df_{\ell}(x) \in \mathbb{R}^{T\times d}5 is architecturally identical to a standard ViT layer and contains independent parameters for each block, but those parameters are tied and reused across multiple depth steps within a phase. The canonical forward pass for f(x)RT×df_{\ell}(x) \in \mathbb{R}^{T\times d}6, with f(x)RT×df_{\ell}(x) \in \mathbb{R}^{T\times d}7, is:

  1. Layer Normalization: f(x)RT×df_{\ell}(x) \in \mathbb{R}^{T\times d}8
  2. Multi-Head Self-Attention (MHSA) with Residual:

f(x)RT×df_{\ell}(x) \in \mathbb{R}^{T\times d}9

=0,...,L\ell = 0, ..., L0

  1. Layer Normalization and MLP with Residual:

=0,...,L\ell = 0, ..., L1

=0,...,L\ell = 0, ..., L2

Here, =0,...,L\ell = 0, ..., L3 is the per-head dimensionality, =0,...,L\ell = 0, ..., L4 generates queries, keys, and values for all heads, and =0,...,L\ell = 0, ..., L5 are the MLP weights. The per-block computational cost is identical to an untied ViT layer.

3. Empirical Evidence and Phase Structure

Analysis of between-layer representational similarity matrices for standard ViTs reveals that activations are highly similar within contiguous regions, suggesting the presence of a few computational “phases.” Empirically, block-recurrent surrogates constructed via the Raptor formulation effectively leverage this phase structure. For instance, a Raptor model trained using only two blocks (=0,...,L\ell = 0, ..., L6) can recover 96% of the linear probe accuracy of DINOv2-trained ImageNet-1k ViTs, at equivalent computational budget.

Additionally, the presence of stochastic depth during training increases both the expressivity and the emergence of block-recurrent structure, facilitating Raptor surrogates that reliably track the original ViT’s intermediate activations (Jacobs et al., 23 Dec 2025).

4. The Raptor Training Protocol

“Raptor” is the term assigned to the block-tied student approximations of pretrained ViTs. Raptor models are trained in two stages:

  • Teacher-Forcing Stage: The Raptor student is driven to reproduce all intermediate layer activations of the teacher network, with block parameters updated via losses targeting individual layers’ outputs.
  • Phase Discovery: The sequence of blocks and their repetition counts =0,...,L\ell = 0, ..., L7 are fitted to minimize the discrepancy between Raptor’s outputs and those of the untied teacher, subject to the =0,...,L\ell = 0, ..., L8-tolerance constraint.

This protocol yields block-recurrent surrogates whose stepwise activations closely shadow those of the teacher ViT, enabling direct correspondence between interpretive structure in the tied model and the underlying teacher.

5. Dynamical Interpretability and Trajectory Analysis

Block-multihead recurrence permits the application of dynamical systems analysis to ViTs, providing mechanistic insight into their operation. Observed dynamics include:

  • Directional convergence: Model trajectories cluster into class-dependent angular basins, with robust self-correcting behavior under small perturbations.
  • Token-specific dynamics: The [cls] token exhibits sharp, late-stage reorientations, while patch tokens display strong late coherence toward their mean direction.
  • Rank collapse: Late in depth, model updates collapse to low-rank structure, consistent with convergence toward low-dimensional attractors.

These findings reveal that a compact, recurrent program underlies ViT computation, supporting a low-complexity, principled understanding of depth-wise information processing (Jacobs et al., 23 Dec 2025).

6. Broader Implications and Connections

Block-multihead recurrence offers a normative solution for depth utilization in transformer architectures and aligns with approaches in dynamical systems theory. The recurrent formulation arises naturally from architectural and training factors—such as stochastic depth—and provides a compact and interpretable model class that maintains original ViT accuracy. A plausible implication is that block-recurrent surrogates could facilitate inference speedups, model compression, and theoretical analysis of representation dynamics. Furthermore, this framework establishes a basis for further principled study of phase-wise computation in deep vision models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Block-Multihead Recurrence.