Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Frozen Transformers in Language Models Are Effective Visual Encoder Layers (2310.12973v2)

Published 19 Oct 2023 in cs.CV, cs.AI, cs.CL, and cs.LG
Frozen Transformers in Language Models Are Effective Visual Encoder Layers

Abstract: This paper reveals that LLMs, despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing pure 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and OPT) and different LLM transformer blocks. We additionally propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding -- the pre-trained LLM transformer blocks discern informative visual tokens and further amplify their effect. This hypothesis is empirically supported by the observation that the feature activation, after training with LLM transformer blocks, exhibits a stronger focus on relevant regions. We hope that our work inspires new perspectives on utilizing LLMs and deepening our understanding of their underlying mechanisms. Code is available at https://github.com/ziqipang/LM4VisualEncoding.

Insights into "Frozen Transformers in LLMs Are Effective Visual Encoder Layers"

The paper entitled "Frozen Transformers in LLMs Are Effective Visual Encoder Layers" presents a nuanced exploration into the application of LLMs as visual encoders, independent of conventional multi-modal frameworks. Through a comprehensive evaluation across a diversity of visual tasks, the authors demonstrate that a frozen transformer block from pre-trained LLMs significantly enhances visual encoding performance. This is achieved without involving language inputs or prompts, marking a clear departure from typical vision-LLMs which necessitate multi-modal integrations.

In their experiments, the authors consider a variety of visual tasks, including 2D and 3D recognition, temporal modeling, non-semantic tasks like motion forecasting, and even multi-modal tasks such as 2D/3D visual question answering. A key innovation described is the insertion of a pre-trained LLM transformer block into existing visual encoders to act as a feature processing layer, effectively bridging the gap between textual knowledge and visual data representation. The authors argue that this approach leverages the rich semantic priors encapsulated within LLMs, which are adept at discerning and amplifying informative visual tokens, even when devoid of direct exposure to visual data during pre-training.

A significant contribution of the paper is the proposal of the "information filtering hypothesis". This hypothesis suggests that the effectiveness of LLM transformer blocks in visual encoding lies in their ability to filter and amplify informative visual tokens. By highlighting relevant regions in the visual field with heightened feature activation, these transformers guide the model towards more semantically meaningful representations. This hypothesis is empirically supported by experiments showing a pronounced concentration on relevant visual regions after integration of the LLM transformers.

Experimentally, the authors reveal that performances across tasks exhibit consistent improvement when integrating these frozen blocks. For instance, image classification benchmarks demonstrate notable gains in both standard accuracy and robustness to noise and adversarial examples. Such enhancements are further corroborated in point cloud recognition, video-based action recognition, and motion forecasting, underscoring the versatility and robustness of the proposed method.

Moreover, the paper provides a critical examination of the scalability of this approach, illustrating that the benefits of incorporating LLM transformers become pronounced only at sufficient model scales, such as those of LLaMA and OPT models. Additionally, analytical experiments underscore the influence of LLM transformer depth, revealing that different layers impart distinct enhancements, with the final transformer blocks often yielding optimal results across tasks.

Despite the promising results, the paper maintains a critical perspective, acknowledging that while the hypothesis offers a robust framework for understanding the benefits of frozen transformers, further inquiry is warranted to delineate the roles of individual network layers and the dynamics of the training process. The ongoing exploration and experimental validation are expected to deepen the understanding of LLMs in visual tasks, potentially catalyzing new avenues of research in AI.

In conclusion, this paper provides a thought-provoking step forward in the exploration of LLMs for visual data processing, challenging existing paradigms of vision-language integration. The author's approach inspires a reevaluation of how textual machine learning advances can be symbiotically applied to computer vision tasks, suggesting broader implications for multimodal learning and representation theory in AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  2. ScanQA: 3D question answering for spatial scene understanding. In CVPR, 2022.
  3. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  4. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017.
  5. B-cos networks: Alignment is all we need for interpretability. In CVPR, 2022.
  6. B-cos alignment for inherently interpretable cnns and vision transformers. arXiv preprint arXiv:2306.10898, 2023.
  7. Language models are few-shot learners. NeurIPS, 2020.
  8. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  9. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  10. Argoverse: 3d tracking and forecasting with rich maps. In CVPR, 2019.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Knowledge neurons in pretrained transformers. In ACL, 2022.
  13. Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes. In BMVC, 2023.
  14. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  16. An empirical study of training end-to-end vision-and-language transformers. In CVPR, 2022.
  17. Visualizing higher-layer features of a deep network. University of Montreal, 2009.
  18. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In CVPR, 2020.
  19. Large-scale unsupervised semantic segmentation. TPAMI, 2022.
  20. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
  21. Revisiting point cloud shape classification with a simple and effective baseline. In ICML, 2021.
  22. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017a.
  23. The” something something” video database for learning and evaluating visual common sense. In ICCV, 2017b.
  24. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017c.
  25. From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models. In CVPR, 2023.
  26. Visual programming: Compositional visual reasoning without training. In CVPR, 2023.
  27. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  28. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2018.
  29. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  30. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021a.
  31. Natural adversarial examples. In CVPR, 2021b.
  32. Long short-term memory. Neural computation, 1997.
  33. 3d-llm: Injecting the 3d world into large language models. NeurIPS, 2023.
  34. Perceiver: General perception with iterative attention. In ICML, 2021.
  35. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  36. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  37. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, 2021.
  38. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  39. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  40. Grounding language models to images for multimodal inputs and outputs. ICML, 2023.
  41. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  42. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  43. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML, 2023.
  44. Microsoft coco: Common objects in context. In ECCV, 2014.
  45. Towards fast adaptation of pretrained contrastive models for multi-channel video-language retrieval. In CVPR, 2023.
  46. Multimodal motion prediction with stacked transformers. In CVPR, 2021.
  47. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  48. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  49. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  50. SQA3D: Situated question answering in 3d scenes. In ICLR, 2023.
  51. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. ICLR, 2022.
  52. Locating and editing factual associations in gpt. NeurIPS, 2022a.
  53. Mass editing memory in a transformer. arXiv preprint arXiv:2210.07229, 2022b.
  54. Linearly mapping from image to text space. ICLR, 2023.
  55. Im2text: Describing images using 1 million captioned photographs. NeurIPS, 2011.
  56. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
  57. Deep hough voting for 3d object detection in point clouds. In ICCV, 2019.
  58. Learning transferable visual models from natural language supervision. In ICML, 2021.
  59. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  60. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  61. Multimodal neurons in pretrained text-only transformers. arXiv preprint arXiv:2308.01544, 2023.
  62. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
  63. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  64. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  65. Top-down visual attention from analysis by synthesis. In CVPR, 2023.
  66. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  67. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  68. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 2023.
  69. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 2022.
  70. Training data-efficient image transformers and distillation through attention. In ICML, 2021.
  71. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  72. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In ICCV, 2019.
  73. Attention is all you need. NeurIPS, 2017.
  74. Learning robust global representations by penalizing local predictive power. NeurIPS, 2019.
  75. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
  76. A theory of usable information under computational constraints. ICLR, 2020.
  77. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In CVPR, 2021.
  78. White-box transformers via sparse rate reduction. arXiv preprint arXiv:2306.01129, 2023.
  79. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  80. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023a.
  81. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  82. Motiongpt: Finetuned llms are general-purpose motion generators. arXiv preprint arXiv:2306.10900, 2023b.
  83. Interpretable basis decomposition for visual explanation. In ECCV, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ziqi Pang (16 papers)
  2. Ziyang Xie (8 papers)
  3. Yunze Man (17 papers)
  4. Yu-Xiong Wang (87 papers)
Citations (18)
Youtube Logo Streamline Icon: https://streamlinehq.com