Raptor: Recurrent Approximations in Vision Transformers
- The paper introduces the Block-Recurrent Hypothesis, showing that recurring phase structures can compress deep Vision Transformers into a few recurrent, weight-shared blocks.
- The methodology employs dynamic programming to partition transformer layers and utilizes teacher-forcing combined with autoregressive loss to align student and teacher activations.
- Empirical evaluations demonstrate that Raptor recovers up to 98% of baseline accuracy on ImageNet-1k while maintaining computational cost and offering new tools for dynamical interpretability.
Recurrent Approximations to Phase-structured Transformers (Raptor) provide a mechanism for compressing deep Vision Transformers (ViTs) by exploiting recurring computational phases, enabling the representation of a depth- ViT as a composition of parameter-tied blocks recurrently applied according to a phase structure. Raptor offers a practical realization of the Block-Recurrent Hypothesis (BRH), yielding both efficient surrogates and new tools for dynamical interpretability within ViTs (Jacobs et al., 23 Dec 2025).
1. Block-Recurrent Hypothesis and Motivation
The Block-Recurrent Hypothesis posits that, while standard ViTs utilize distinct residual-Transformer blocks, trained models frequently exhibit contiguous phases along depth, as evidenced by block-diagonal patterns in representational similarity matrices. These phases are hypothesized to be not only similar in representation but functionally equivalent, such that the computations across a sequence of layers may be reconstructed using a much smaller set of recurrently applied blocks.
The –BRH formally defines this property: for a pretrained ViT of depth , there exist blocks and integers (with ) such that for all images ,
0
where 1 denotes application of block 2 for 3 successive layers using shared parameters. This property mandates recovery of all intermediate activations—not merely final outputs—thus distinguishing it from degenerate bottlenecked or single-block solutions.
The emergence of such phase structure motivates Raptor: a constructive, recurrent surrogate scheme that distills a pretrained ViT into 4 functionally recurrent, weight-shared blocks.
2. Mathematical Formulation and Training Methodology
Given an input image 5 with patch encoder output 6, let 7 denote the 8th layer transformer output. The layer depth is partitioned into 9 contiguous phases of length 0 by maximally block-diagonalizing the layer-layer cosine similarity matrix 1 using a max-cut dynamic programming procedure.
The Raptor surrogate introduces 2 weight-shared blocks 3, each architecturally matching their ViT counterparts. The activation at layer 4 for the student is defined as:
5
where 6 falls into the 7th phase and 8.
The training objective combines two strategies:
- Teacher-forcing (TF):
9
- Autoregressive (AR):
0
Overall, the loss is:
1
with 2 annealed from 3 in early epochs and 4 representing regularization (e.g., weight decay).
3. Practical Training Procedure and Phase Discovery
Phase boundaries are identified by maximizing intra-block layer similarity via dynamic programming with 5 complexity, ensuring optimal contiguous block partitioning.
The Raptor training pipeline consists of two stages:
- Block-wise Pretraining: Each block is initially trained as a multi-layer student, with both TF and AR loss, using AdamW (weight decay 6), learning rate warmup (7) with cosine decay, batch size 64, and 8 annealed over 95 epochs. Token weighting (e.g., 0, 1, 2) can be introduced to optimize specific token groups.
- End-to-End Surrogate Assembly: All blocks are composed into a full 3-layer recurrent model, trained end-to-end with pure AR loss (i.e., 4) for 20 epochs, maintaining the optimizer and token weights.
For evaluation, the backbone can be frozen and shallow probe heads trained on image classification (ImageNet-1k), semantic segmentation (ADE20k), or depth estimation (NYUv2).
Small-scale experiments (CIFAR-100) reveal that increasing teacher stochastic depth (5) enhances layer similarity, improves Raptor fit, and increases both teacher and student final accuracy. R-squared matching quantifies alignment between student and teacher token embeddings at each layer as a function of 6.
4. Empirical Evaluation and Causal Analysis
Empirical investigations reveal that off-the-shelf ViTs (e.g., DINOv2, CLIP, plain supervised) consistently exhibit block-diagonal layer similarity matrices. The identified phases correlate strongly with the compressibility achievable via Raptor; random (non-phase-aligned) partitions produce significantly worse performance, degrading at least 7 below optimal phase-aligned partitions.
Performance Metrics
| Backbone | Task | Baseline Acc. | Raptor (8) | Raptor (9) | Raptor (0) |
|---|---|---|---|---|---|
| DINOv2-Base | ImageNet-1k | 84.5% | 81.2% (96%) | 83.0% (98%) | 83.2% |
| DINOv2-Base | ADE20k mIoU | 47.5 | — | 43.0 | — |
| DINOv2-Base | NYUv2 RMSE | 0.578 | — | 0.618 | — |
Using only 1 blocks, Raptor recovers 96% of top-1 linear probe accuracy of the DINOv2-Base on ImageNet-1k at equivalent computational cost; 2 recovers 98%. Cosine alignment per layer remains above 0.7 for all 3.
Causal swapping experiments validate the functional distinctness of phases: replacing a layer with another from the same recurrent block maintains top-1 accuracy, while cross-phase swaps severely degrade performance, confirming that discovered phases correspond to genuinely distinct computations.
Computational cost remains unchanged: 4-block Raptor incurs the same FLOPs as an 5-layer ViT, since each block is applied for 6 steps with shared parameters.
5. Dynamical Interpretability Insights
Raptor enables detailed dynamical systems analysis of ViT computation by treating depth index 7 as discrete time. Several interpretability results characterize the evolution of token embeddings:
- Directional Convergence: Normalized token directions 8 approach class-dependent angular basins, with the curve 9 displaying S-shaped convergence to 1. Principal component projections show trajectories for distinct ImageNet classes clustering into tight angular basins.
- Token-specific Angular Speeds: 0 is low and stable for register tokens; moderate for patch tokens; but spikes for the cls token in late phases (aggregation). Max-cut phase boundaries correspond to abrupt changes in these dynamics.
- Self-correction Under Perturbations: Injecting small Gaussian noise at intermediate layers results in log-linear contraction of deviation 1 for patch tokens, yet accumulated error for cls tokens in the final phase (consistent with their readout role).
- Low-rank Collapse and Coherence: The stable rank 2 of angular update matrices drops from approximately 20 to 6 over depth. Patch tokens' coherence 3 increases towards unity, indicating late-layer collective movement and convergence to low-dimensional attractors.
- Dynamic Mode Decomposition (DMD): For each token group, DMD of normalized group mean trajectories yields spectral modes with eigenvalues just inside the unit circle, suggesting weak contraction and dominant rotational dynamics. The cls modes are closest to +1, reflecting long memory, while patch groups are more contractive and rotational.
6. Practical Application, Limitations, and Extensions
Raptor can be applied to any pretrained ViT with standard residual Transformer blocks. Practitioners first compute a layer-layer cosine similarity matrix over a small validation set and apply contiguous max-cut dynamic programming, typically yielding 4 phases. If strong block-diagonal structure is present, Raptor is likely to succeed.
Recommended training follows the two-stage TF 5 AR distillation with 6 annealing, AdamW optimizer (7), learning rate warmup and cosine schedule, batch size 64, and 20–40 epochs. For foundation-scale models, the backbone should be frozen and only probe heads fine-tuned on downstream tasks.
Limitations include a residual accuracy penalty (typically 8 top-1 on ImageNet with 9); bridging this gap may require non-autonomous recurrence (e.g., explicit depth encodings) or partial sharing (e.g., block-specific adapters). Application to very deep or irregular-phase networks can be challenging if representation phases are ambiguous or heavily overlapping.
Possible extensions include:
- Integrating explicit depth encodings, such as a depth-scale MLP, to enable non-autonomous recurrence.
- Augmenting each block with small, block-specific adapter layers to capture finer inter-phase distinctions.
- Applying the methodology to language transformers or hybrid vision-language backbones.
- Utilizing Raptor surrogates for formal verification and for systematic rollout sampling in interpretability studies.
7. Implications and Future Directions
The Raptor framework offers a scalable, mechanistically grounded route for compressing and interpreting ViTs by functionally decomposing depth into genuinely recurrent computation. Empirical validation across standard vision models establishes the presence of robust phase structure and the feasibility of accurate recurrent surrogates. This suggests a path toward dynamical systems-theoretic analyses, algorithmic complexity benchmarking, and systematized interpretability rooted in the identified low-complexity recurrent programs. Further exploration of non-autonomous recurrence and extension to non-visual domains are promising areas for future research (Jacobs et al., 23 Dec 2025).