Raptor: Recurrence in Phase-structured Transformers

Updated 24 December 2025

The paper introduces Raptor, a compact surrogate for ViTs that uses recurrent, parameter-tied blocks to closely approximate full teacher activations.
It employs phase identification via representational similarity analysis to segment transformer depth into contiguous, functional blocks.
Empirical results on benchmarks like ImageNet-1K demonstrate near-equivalence to full ViTs, highlighting practical gains in dynamical interpretability and efficiency.

Recurrent Approximations to Phase-structured TransfORmers (Raptor) are compact dynamical surrogates of pretrained Vision Transformers (ViTs) that instantiate the Block-Recurrent Hypothesis (BRH): the computation across $L$ transformer layers can be closely approximated by the sequential application of only $k \ll L$ distinct, parameter-tied blocks, each repeated for a contiguous depth interval ("phase"). Raptor models are constructed by identifying these phases using representational similarity analysis and training tied-block surrogates to replicate all intermediate activations of the foundational ViT. This approach reveals an inherent low-complexity recurrent structure, provides a principled mechanism for dynamical interpretability, and demonstrates empirical near-equivalence to full-capacity ViTs across a suite of vision benchmarks (Jacobs et al., 23 Dec 2025).

1. Mathematical Formulation of the Block-Recurrent Hypothesis

The BRH provides a rewriting of a depth- $L$ ViT $\f_L$:

Given input $\x$ (patch and cls embeddings) as $\a_0(\x) \in \R^{T \times d}$, let the standard layerwise computation be $\a_\ell(\x) = \f_\ell(\a_{\ell-1}(\x))$ for $\ell = 1,\dots,L$ .
BRH asserts existence of $k \ll L$ tied blocks $\B_1, \ldots, \B_k$ and repetition counts $n_1, \ldots, n_k$ ( $\sum_j n_j = L$ ) such that:

$\f_L(\x) = \underbrace{\B_k \circ \cdots \circ \B_k}_{n_k\text{ times}} \circ \cdots \circ \underbrace{\B_1 \circ \cdots \circ \B_1}_{n_1\text{ times}} \bigl(\a_0(\x)\bigr).$

The recurrent procedure is expressed: $h^{(0)} = \a_0(\x)$, $h^{(t+1)} = \B_{(t \bmod k)+1}(h^{(t)})$, with $h^{(L)}$ closely matching $\a_L(\x)$.

The formal $\varepsilon$ -BRH condition: $\E_{\x\sim\P}\left\| \a_\ell(\x) - (\B_k^{n_k} \circ \cdots \circ \B_1^{n_1})(\a_0(\x)) \right\|_F \le \varepsilon,$ for $\sum_{j=1}^k n_j = \ell$ with $k \ll \ell$ .

This framework motivates the construction of Raptor surrogates as tied-block models that mirror the canonical ViT computation, enabling a compact recurrent approximation.

2. Phase Identification by Representational Similarity

The phase structure underlying BRH is discovered by layerwise representational similarity analysis:

Compute similarity matrix $S \in \R^{L \times L}$ , where for layers $i, j$ , token matrices $\A_i, \A_j$ are compared by:

$S_{ij} = \cos(\A_i, \A_j) = \frac{\langle \A_i, \A_j \rangle_F}{\|\A_i\|_F \|\A_j\|_F}$

Alternatives include CKA and SVCCA. Across all studied ViTs (DINOv2, CLIP, etc.), $S$ displays pronounced block-diagonal structure indicative of few contiguous phases (see Fig. 1).

To segment depth, a contiguous max-cut is solved by dynamic programming:

$g(i, j) = \frac{1}{(j-i+1)^2} \sum_{p=i}^j \sum_{q=i}^j S_{pq}$

Segments $[b_1, e_1], \ldots, [b_k, e_k]$ are selected to maximize $\sum_t g(b_t, e_t)$ for given $k$ , subject to minimum block length $m$ .

Backtracking recovers phase boundaries (Fig. 2), which specify the schedule of Raptor block repetition.

This phase-aware segmentation captures regions of functional homogeneity and underpins the recurrent parameter tying in Raptor construction.

3. Training of Raptor Surrogates

Raptor models are trained to replicate the full sequence of teacher ViT activations using $k$ tied blocks obtained from phase analysis:

Two-stage hybrid loss:
- Teacher-forcing (TF): each block $\B_j$ is optimized to predict within its depth segment using ground-truth (teacher) inputs.
- Autoregressive (AR) rollout: blocks are chained for closed-loop prediction of activations.
For input $\x$, student's prediction $\hat\a_{\ell}(\x)$ at layer $\ell$ is supervised:

$\L_{\mathrm{AR},h}(\x) = \sum_{\ell=1}^{h} \|\hat\a_\ell(\x) - \a_\ell(\x)\|_F^2$

TF stage loss:

$\L_{\mathrm{TF}}(\x) = \sum_{\ell \text{ in block}} \|\B_j(\a_{\ell-1}(\x)) - \a_\ell(\x)\|_F^2$

Final optimization combines these, annealing from TF to AR, with regularization:

$\L_{\mathrm{total}}(\x) = \lambda\,\L_{\mathrm{TF}}(\x) + (1-\lambda)\,\L_{\mathrm{AR},L}(\x) + \Omega(\theta)$

Training utilizes ImageNet-1K, AdamW optimizer, cosine learning rate decay, token-type weighted losses, and phase splits from contiguous max-cut (Appendix A.1).

This protocol induces strong recurrent structure and ensures maximal fidelity between student and teacher activation trajectories.

4. Empirical Performance and Ablation Analysis

Raptor surrogates achieve substantial accuracy and trajectory fidelity relative to full ViTs:

Model	ImageNet-1K Top-1	% of DINOv2-B
Raptor $k=2$	$81.2\pm0.2$ \%	$\approx96$ \%
Raptor $k=3$	$83.0\pm0.1$ \%
Raptor $k=4$	$83.2\pm0.1$ \%
DINOv2 ViT-S	$80.9$\%
DINOv2 ViT-B (teacher)	$84.5$\%	$100$\%

Metrics for ADE20k (mIoU) and NYUv2 (RMSE) reflect similar trends (Table 1). Cosine similarity per layer remains above $0.7$, evidencing high-fidelity activation matching (Fig. 12). Ablation experiments reveal AR supervision is necessary for meaningful fit (TF alone $\approx$ 4\%), while additional architectural and loss refinements incrementally improve accuracy (Table 2).

Causal interventions confirm phase uniqueness: intra-block layer swapping preserves performance while inter-block swaps disrupt accuracy (Fig. 13).

5. Dynamical Interpretability and Discrete Flow on the Sphere

Depth in ViTs and Raptor is interpreted as a discrete dynamical system on the unit sphere, separating activation norm and direction:

Directional convergence into class-dependent angular basins:
- Alignment $\gamma_\ell = \langle \hat\x_\ell, \hat\x_L \rangle$ saturates near $1$ in late layers (Fig. 6).
- PCA projections demonstrate collapse of trajectories into compact basins per class (Fig. 6).
- Perturbation experiments: injected error at intermediate layer self-corrects, suggesting weak contraction on the sphere (Fig. 7).
Token-specific dynamical signatures and phase transitions:
- Per-layer angular speed $s_\ell = \arccos\langle \hat\x_{\ell+1}, \hat\x_\ell \rangle$ quantifies rotation rate.
- Register (“reg”) tokens show early stability, patch tokens transition mid-depth, cls tokens reorient sharply only in the late phase (Fig. 8).
- Across phase boundaries, angular speed resets, demarcating functional transitions.
- Sensitivity to perturbation depth differs: patch token deviation decays with increasing depth; cls token sensitivity grows toward final phase (Fig. 7B).
Collapse to low-rank updates in late layers:
- Angular update matrix $U_\ell$ and stable rank $r_s(U_\ell)$ drop steadily; in late phase, $r_s \approx 6$ .
- Coherence $\kappa_\ell$ rises in final blocks; DMD eigenvalues for group-averaged states are near $+1$ on the real axis, cls tokens maintain longest memory (Fig. 9).

These findings indicate the emergence of low-dimensional attractors and self-correcting, phase-specific dynamics.

6. Complexity Bounds and Dynamical Systems Analysis

Block-recurrence in Raptor and ViTs yields a strongly compressed descriptive and computational complexity:

Parameter sharing in $k$ tied blocks exchanges redundancy for iterative reuse.
Levin’s time-bounded complexity is explicitly bounded:

$K_{\mathrm{Levin}}(\f_L) \le \sum_{j=1}^k DL(\theta(\B_j)) + O(k\log L) + \log R(\f_L) + O(1)$

where $DL(\cdot)$ is description length and $R(\cdot)$ is runtime (Appendix D).

Viewing transformer depth as discrete flow trajectories on the sphere facilitates dynamical systems analysis with methods such as DMD, stability theory, and attractor geometry.
Raptor surrogates enable mechanistic interpretability and foundation model verification via compact recurrent “programs” extracted directly from large-scale ViTs.

A plausible implication is that the extraction and training of Raptor architectures may provide an efficient pathway for slicing and interrogating the functional organization of next-generation vision backbones.

7. Context, Impact, and Future Directions

Raptor operationalizes the Block-Recurrent Hypothesis for foundation models, demonstrating that the apparent depth and complexity of ViTs can be recast as a small, phase-structured recurrent dynamical system without compromising functional capacity (Jacobs et al., 23 Dec 2025). The empirical results verify that this structure is both reusable and interpretable, laying groundwork for further mechanistic study of transformer neural networks. The dynamical systems viewpoint supports rigorous investigation of stability, attractor formation, and information flow in deep architectures.

A plausible implication is the existence of related block-recurrent or phase-aware surrogates for other transformer architectures across modalities. This suggests future prospects for highly compressive, interpretable, and resilient model distillations in both vision and broader deep learning contexts.

PDF Markdown Chat (Pro)

References (1)

Block-Recurrent Dynamics in Vision Transformers (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Recurrent Approximations to Phase-structured TransfORmers (Raptor).