Abstract: CRATE, a white-box transformer architecture designed to learn compressed and sparse representations, offers an intriguing alternative to standard vision transformers (ViTs) due to its inherent mathematical interpretability. Despite extensive investigations into the scaling behaviors of language and vision transformers, the scalability of CRATE remains an open question which this paper aims to address. Specifically, we propose CRATE-$\alpha$, featuring strategic yet minimal modifications to the sparse coding block in the CRATE architecture design, and a light training recipe designed to improve the scalability of CRATE. Through extensive experiments, we demonstrate that CRATE-$\alpha$ can effectively scale with larger model sizes and datasets. For example, our CRATE-$\alpha$-B substantially outperforms the prior best CRATE-B model accuracy on ImageNet classification by 3.7%, achieving an accuracy of 83.2%. Meanwhile, when scaling further, our CRATE-$\alpha$-L obtains an ImageNet classification accuracy of 85.1%. More notably, these model performance improvements are achieved while preserving, and potentially even enhancing the interpretability of learned CRATE models, as we demonstrate through showing that the learned token representations of increasingly larger trained CRATE-$\alpha$ models yield increasingly higher-quality unsupervised object segmentation of images. The project page is https://rayjryang.github.io/CRATE-alpha/.
The paper introduces Crate-α, a white-box transformer that improves scalability and interpretability via sparse coding modifications.
It achieves up to 5.3% performance gains with an overparameterized sparse coding block, decoupled dictionaries, and residual connections.
Experiments on ImageNet and DataComp1B demonstrate significant accuracy improvements and compute-efficient scaling across model sizes.
An Overview of the Crate-α Model for Scalable and Interpretable Vision Transformers
This paper addresses an important gap in the literature related to the scalability and interpretability of vision transformers. The authors introduce Crate-α, a white-box transformer architecture that builds on the earlier Crate model. The focus of Crate-α is to enhance both mathematical interpretability and performance at scale. This is achieved through specific modifications to the sparse coding block and an optimized training approach.
Introduction
While transformers have proven remarkably effective across various domains, including NLP and computer vision, their design is often empirical and lacks rigorous mathematical grounding. The Crate model introduced a white-box architecture that employs unrolled optimization principles for sparse rate reduction, offering a more interpretable alternative to standard transformers. However, questions about the scalability of Crate remained unanswered. This paper aims to fill this void by proposing Crate-α, which incorporates minimal yet strategic modifications to improve scalability without sacrificing interpretability.
Crate-α Architecture
The Crate-α variant is designed to address specific constraints observed in the original Crate model, particularly in the ISTA block. Three key modifications are introduced:
Overparameterized Sparse Coding Block: The adoption of an overcomplete dictionary (D∈Rd×Cd), where C>1, improves the expressiveness of the sparse coding block. This change leads to a 5.3% improvement in performance.
Decoupled Dictionary: Introducing a decoupled dictionary D further boosts performance by 2.0%.
Residual Connection: Adding residual connections contributes an additional 0.7% improvement, enhancing the overall scaling behaviors of the model.
Experimental Evaluation
Scaling from Base to Large
To explore scalability, the authors conducted extensive pre-training on ImageNet-21K and fine-tuning on ImageNet-1K. The results are compelling:
Crate-α-B/32: Achieves 76.5% top-1 accuracy on ImageNet-1K.
Crate-α-L/8: Reaches a top-1 accuracy of 85.1%, significantly outperforming the original Crate models.
The improved training loss trends indicate that Crate-α benefits from increased model capacity, overcoming the problem of diminishing returns observed in original Crate models.
Scaling to Huge
To further test scalability, the authors utilized the DataComp1B dataset within a vision-language pre-training paradigm similar to CLIP:
Crate-CLIPA-L/14: Achieves zero-shot top-1 accuracy of 69.2% on ImageNet-1K during pre-training stage.
Crate-CLIPA-H/14: Improves zero-shot top-1 accuracy to 72.3%, demonstrating strong scalability even with massive data.
Compute-Efficient Scaling
The authors also explore efficient scaling strategies to reduce computational requirements without compromising performance:
Pre-training Crate-L/32 and fine-tuning with Crate-L/14 resulted in a comparable top-1 accuracy of 83.7%, while reducing compute by over 70%.
Similarly, Crate-L/32 pre-training followed by Crate-L/8 fine-tuning achieved 84.2% top-1 accuracy, consuming just 10% of the computational resources needed for Crate-L/8 pre-training.
Semantic Interpretability
The quality of feature representations learned by Crate-α was validated using MaskCut for object segmentation on COCO val2017:
The learned token representations demonstrated clear semantic structures, outperforming baseline models and retaining high interpretability as model size increased.
Discussion
While Crate-α offers significant performance and interpretability gains, the high computational cost remains a barrier for broader adoption. Nevertheless, the proposed compute-efficient strategies provide a promising direction for future research. The potential applications in vision-LLMs and other downstream tasks could further amortize pre-training costs, making this an impactful advancement for the field.
Conclusion
Crate-α represents a significant step towards scalable and interpretable vision transformers. By addressing the limitations of the original Crate model through thoughtful architectural modifications and training strategies, this research provides crucial insights into the potential for large-scale, interpretable models in computer vision. Future work will likely explore further scaling and optimization techniques, potentially revolutionizing the deployment of interpretable machine learning models.