Papers
Topics
Authors
Recent
Search
2000 character limit reached

Raptor: Recurrent Approximations in Vision Transformers

Updated 16 June 2026
  • The paper introduces the Block-Recurrent Hypothesis, showing that recurring phase structures can compress deep Vision Transformers into a few recurrent, weight-shared blocks.
  • The methodology employs dynamic programming to partition transformer layers and utilizes teacher-forcing combined with autoregressive loss to align student and teacher activations.
  • Empirical evaluations demonstrate that Raptor recovers up to 98% of baseline accuracy on ImageNet-1k while maintaining computational cost and offering new tools for dynamical interpretability.

Recurrent Approximations to Phase-structured Transformers (Raptor) provide a mechanism for compressing deep Vision Transformers (ViTs) by exploiting recurring computational phases, enabling the representation of a depth-LL ViT as a composition of k≪Lk \ll L parameter-tied blocks recurrently applied according to a phase structure. Raptor offers a practical realization of the Block-Recurrent Hypothesis (BRH), yielding both efficient surrogates and new tools for dynamical interpretability within ViTs (Jacobs et al., 23 Dec 2025).

1. Block-Recurrent Hypothesis and Motivation

The Block-Recurrent Hypothesis posits that, while standard ViTs utilize LL distinct residual-Transformer blocks, trained models frequently exhibit contiguous phases along depth, as evidenced by block-diagonal patterns in representational similarity matrices. These phases are hypothesized to be not only similar in representation but functionally equivalent, such that the computations across a sequence of layers may be reconstructed using a much smaller set of recurrently applied blocks.

The ϵ\epsilon–BRH formally defines this property: for a pretrained ViT of depth LL, there exist kk blocks B1,…,BkB_1,\ldots,B_k and integers n1,…,nkn_1,\ldots,n_k (with n1+…+nk=Ln_1+\ldots+n_k=L) such that for all images xx,

k≪Lk \ll L0

where k≪Lk \ll L1 denotes application of block k≪Lk \ll L2 for k≪Lk \ll L3 successive layers using shared parameters. This property mandates recovery of all intermediate activations—not merely final outputs—thus distinguishing it from degenerate bottlenecked or single-block solutions.

The emergence of such phase structure motivates Raptor: a constructive, recurrent surrogate scheme that distills a pretrained ViT into k≪Lk \ll L4 functionally recurrent, weight-shared blocks.

2. Mathematical Formulation and Training Methodology

Given an input image k≪Lk \ll L5 with patch encoder output k≪Lk \ll L6, let k≪Lk \ll L7 denote the k≪Lk \ll L8th layer transformer output. The layer depth is partitioned into k≪Lk \ll L9 contiguous phases of length LL0 by maximally block-diagonalizing the layer-layer cosine similarity matrix LL1 using a max-cut dynamic programming procedure.

The Raptor surrogate introduces LL2 weight-shared blocks LL3, each architecturally matching their ViT counterparts. The activation at layer LL4 for the student is defined as:

LL5

where LL6 falls into the LL7th phase and LL8.

The training objective combines two strategies:

  • Teacher-forcing (TF):

LL9

  • Autoregressive (AR):

ϵ\epsilon0

Overall, the loss is:

ϵ\epsilon1

with ϵ\epsilon2 annealed from ϵ\epsilon3 in early epochs and ϵ\epsilon4 representing regularization (e.g., weight decay).

3. Practical Training Procedure and Phase Discovery

Phase boundaries are identified by maximizing intra-block layer similarity via dynamic programming with ϵ\epsilon5 complexity, ensuring optimal contiguous block partitioning.

The Raptor training pipeline consists of two stages:

  1. Block-wise Pretraining: Each block is initially trained as a multi-layer student, with both TF and AR loss, using AdamW (weight decay ϵ\epsilon6), learning rate warmup (ϵ\epsilon7) with cosine decay, batch size 64, and ϵ\epsilon8 annealed over ϵ\epsilon95 epochs. Token weighting (e.g., LL0, LL1, LL2) can be introduced to optimize specific token groups.
  2. End-to-End Surrogate Assembly: All blocks are composed into a full LL3-layer recurrent model, trained end-to-end with pure AR loss (i.e., LL4) for 20 epochs, maintaining the optimizer and token weights.

For evaluation, the backbone can be frozen and shallow probe heads trained on image classification (ImageNet-1k), semantic segmentation (ADE20k), or depth estimation (NYUv2).

Small-scale experiments (CIFAR-100) reveal that increasing teacher stochastic depth (LL5) enhances layer similarity, improves Raptor fit, and increases both teacher and student final accuracy. R-squared matching quantifies alignment between student and teacher token embeddings at each layer as a function of LL6.

4. Empirical Evaluation and Causal Analysis

Empirical investigations reveal that off-the-shelf ViTs (e.g., DINOv2, CLIP, plain supervised) consistently exhibit block-diagonal layer similarity matrices. The identified phases correlate strongly with the compressibility achievable via Raptor; random (non-phase-aligned) partitions produce significantly worse performance, degrading at least LL7 below optimal phase-aligned partitions.

Performance Metrics

Backbone Task Baseline Acc. Raptor (LL8) Raptor (LL9) Raptor (kk0)
DINOv2-Base ImageNet-1k 84.5% 81.2% (96%) 83.0% (98%) 83.2%
DINOv2-Base ADE20k mIoU 47.5 — 43.0 —
DINOv2-Base NYUv2 RMSE 0.578 — 0.618 —

Using only kk1 blocks, Raptor recovers 96% of top-1 linear probe accuracy of the DINOv2-Base on ImageNet-1k at equivalent computational cost; kk2 recovers 98%. Cosine alignment per layer remains above 0.7 for all kk3.

Causal swapping experiments validate the functional distinctness of phases: replacing a layer with another from the same recurrent block maintains top-1 accuracy, while cross-phase swaps severely degrade performance, confirming that discovered phases correspond to genuinely distinct computations.

Computational cost remains unchanged: kk4-block Raptor incurs the same FLOPs as an kk5-layer ViT, since each block is applied for kk6 steps with shared parameters.

5. Dynamical Interpretability Insights

Raptor enables detailed dynamical systems analysis of ViT computation by treating depth index kk7 as discrete time. Several interpretability results characterize the evolution of token embeddings:

  • Directional Convergence: Normalized token directions kk8 approach class-dependent angular basins, with the curve kk9 displaying S-shaped convergence to 1. Principal component projections show trajectories for distinct ImageNet classes clustering into tight angular basins.
  • Token-specific Angular Speeds: B1,…,BkB_1,\ldots,B_k0 is low and stable for register tokens; moderate for patch tokens; but spikes for the cls token in late phases (aggregation). Max-cut phase boundaries correspond to abrupt changes in these dynamics.
  • Self-correction Under Perturbations: Injecting small Gaussian noise at intermediate layers results in log-linear contraction of deviation B1,…,BkB_1,\ldots,B_k1 for patch tokens, yet accumulated error for cls tokens in the final phase (consistent with their readout role).
  • Low-rank Collapse and Coherence: The stable rank B1,…,BkB_1,\ldots,B_k2 of angular update matrices drops from approximately 20 to 6 over depth. Patch tokens' coherence B1,…,BkB_1,\ldots,B_k3 increases towards unity, indicating late-layer collective movement and convergence to low-dimensional attractors.
  • Dynamic Mode Decomposition (DMD): For each token group, DMD of normalized group mean trajectories yields spectral modes with eigenvalues just inside the unit circle, suggesting weak contraction and dominant rotational dynamics. The cls modes are closest to +1, reflecting long memory, while patch groups are more contractive and rotational.

6. Practical Application, Limitations, and Extensions

Raptor can be applied to any pretrained ViT with standard residual Transformer blocks. Practitioners first compute a layer-layer cosine similarity matrix over a small validation set and apply contiguous max-cut dynamic programming, typically yielding B1,…,BkB_1,\ldots,B_k4 phases. If strong block-diagonal structure is present, Raptor is likely to succeed.

Recommended training follows the two-stage TF B1,…,BkB_1,\ldots,B_k5 AR distillation with B1,…,BkB_1,\ldots,B_k6 annealing, AdamW optimizer (B1,…,BkB_1,\ldots,B_k7), learning rate warmup and cosine schedule, batch size 64, and 20–40 epochs. For foundation-scale models, the backbone should be frozen and only probe heads fine-tuned on downstream tasks.

Limitations include a residual accuracy penalty (typically B1,…,BkB_1,\ldots,B_k8 top-1 on ImageNet with B1,…,BkB_1,\ldots,B_k9); bridging this gap may require non-autonomous recurrence (e.g., explicit depth encodings) or partial sharing (e.g., block-specific adapters). Application to very deep or irregular-phase networks can be challenging if representation phases are ambiguous or heavily overlapping.

Possible extensions include:

  • Integrating explicit depth encodings, such as a depth-scale MLP, to enable non-autonomous recurrence.
  • Augmenting each block with small, block-specific adapter layers to capture finer inter-phase distinctions.
  • Applying the methodology to language transformers or hybrid vision-language backbones.
  • Utilizing Raptor surrogates for formal verification and for systematic rollout sampling in interpretability studies.

7. Implications and Future Directions

The Raptor framework offers a scalable, mechanistically grounded route for compressing and interpreting ViTs by functionally decomposing depth into genuinely recurrent computation. Empirical validation across standard vision models establishes the presence of robust phase structure and the feasibility of accurate recurrent surrogates. This suggests a path toward dynamical systems-theoretic analyses, algorithmic complexity benchmarking, and systematized interpretability rooted in the identified low-complexity recurrent programs. Further exploration of non-autonomous recurrence and extension to non-visual domains are promising areas for future research (Jacobs et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recurrent Approximations to Phase-structured Transformers (Raptor).