ResiDual Transformer Alignment with Spectral Decomposition (2411.00246v2)
Abstract: When examined through the lens of their residual streams, a puzzling property emerges in transformer networks: residual contributions (e.g., attention heads) sometimes specialize in specific tasks or input attributes. In this paper, we analyze this phenomenon in vision transformers, focusing on the spectral geometry of residuals, and explore its implications for modality alignment in vision-LLMs. First, we link it to the intrinsically low-dimensional structure of visual head representations, zooming into their principal components and showing that they encode specialized roles across a wide variety of input data distributions. Then, we analyze the effect of head specialization in multimodal models, focusing on how improved alignment between text and specialized heads impacts zero-shot classification performance. This specialization-performance link consistently holds across diverse pre-training data, network sizes, and objectives, demonstrating a powerful new mechanism for boosting zero-shot classification through targeted alignment. Ultimately, we translate these insights into actionable terms by introducing ResiDual, a technique for spectral alignment of the residual stream. Much like panning for gold, it lets the noise from irrelevant unit principal components (i.e., attributes) wash away to amplify task-relevant ones. Remarkably, this dual perspective on modality alignment yields fine-tuning level performance on different data distributions while modelling an extremely interpretable and parameter-efficient transformation, as we extensively show on 70 pre-trained network-dataset combinations (7 models, 10 datasets).
- Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Decomposing and interpreting image representations via text in vits beyond CLIP. In ICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum?id=DwhvppIZsD.
- Interpreting clip with sparse linear concept embeddings (splice). arXiv preprint arXiv:2402.10376, 2024.
- Numerical methods for computing angles between linear subspaces. Mathematics of Computation, 27(123):579–594, 1973. ISSN 00255718, 10886842. URL http://www.jstor.org/stable/2005662.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Nicola Cancedda. Spectral filters, dark signals, and attention sinks, 2024. URL https://arxiv.org/abs/2402.09221.
- Information maximization perspective of orthogonal matching pursuit with applications to explainable ai. Advances in Neural Information Processing Systems, 36, 2024.
- Bridging information-theoretic and geometric compression in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12397–12420, 2023.
- Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829, 2023.
- Summing up the facts: Additive mechanisms behind factual recall in llms. arXiv preprint arXiv:2402.07321, 2024.
- Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3606–3613, 2014.
- The road less scheduled, 2024.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Deep learning for twelve hour precipitation forecasts. Nature communications, 13(1):1–10, 2022.
- Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific reports, 7(1):12140, 2017.
- William Falcon and The PyTorch Lightning team. PyTorch Lightning, March 2019. URL https://github.com/Lightning-AI/lightning.
- Interpreting clip’s image representation via text-based decomposition. In The Twelfth International Conference on Learning Representations, 2024.
- Cnn filter db: An empirical investigation of trained convolutional filters. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19066–19076, 2022.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
- Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999. ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(98)00140-3. URL https://www.sciencedirect.com/science/article/pii/S0893608098001403.
- On the origins of linear representations in large language models. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=otuTw4Mghk.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp. 554–561, 2013.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- On the direct alignment of latent spaces. In Marco Fumero, Emanuele Rodolá, Clementine Domine, Francesco Locatello, Karolina Dziugaite, and Caron Mathilde (eds.), Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, volume 243 of Proceedings of Machine Learning Research, pp. 158–169. PMLR, 15 Dec 2024. URL https://proceedings.mlr.press/v243/lahner24a.html.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Layer normalization. ArXiv e-prints, pp. arXiv–1607, 2016.
- Does clip bind concepts? probing compositionality in large image models. In Findings of the Association for Computational Linguistics: EACL 2024, pp. 1487–1500, 2024.
- Interpreting and exploiting functional specialization in multi-head attention under multi-task learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 16460–16476, 2023.
- Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542–5550, 2017.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pp. 12888–12900. PMLR, 2022.
- Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In NeurIPS, 2022. URL https://openreview.net/forum?id=S7Evzt9uit3.
- Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114–4124. PMLR, 2019.
- Interpreting key mechanisms of factual recall in transformer-based language models. arXiv preprint arXiv:2403.19521, 2024.
- Latent space translation via semantic alignment. Advances in Neural Information Processing Systems, 36, 2024.
- Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41(12):3397–3415, 1993.
- Scaling deep learning for materials discovery. Nature, 624(7990):80–85, 2023.
- Are sixteen heads really better than one? Advances in Neural Information Processing Systems, 32, 2019.
- Relative representations enable zero-shot latent space communication. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SrC-nwieGJ.
- Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, pp.  4. Granada, 2011.
- Asif: Coupled data turns unimodal models to multimodal without training. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 15303–15319. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/3186591903d9db31770ad131adb5ceb4-Paper-Conference.pdf.
- nostalgebraist. interpreting gpt: the logit lens. LessWrong, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
- DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=a68SUt6zFt.
- The linear representation hypothesis and the geometry of large language models. In Forty-first International Conference on Machine Learning.
- Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar conference on signals, systems and computers, pp. 40–44. IEEE, 1993.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Leslie N. Smith. Cyclical learning rates for training neural networks, 2017. URL https://arxiv.org/abs/1506.01186.
- The german traffic sign recognition benchmark: a multi-class classification competition. In The 2011 international joint conference on neural networks, pp. 1453–1460. IEEE, 2011.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Algorithms for simultaneous sparse approximation. part i: Greedy pursuit. Signal processing, 86(3):572–588, 2006.
- The geometry of hidden representations of large transformer models. Advances in Neural Information Processing Systems, 36, 2024.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808, 2019.
- Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1810–1822, 2019.
- The independent compositional subspace hypothesis for the structure of CLIP’s last layer. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023. URL https://openreview.net/forum?id=MmhGK8YkUKO.
- Sun database: Exploring a large collection of scene categories. International Journal of Computer Vision, 119:3–22, 2016.
- How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.