Frozen Transformers in Language Models Are Effective Visual Encoder Layers (2310.12973v2)

Published 19 Oct 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: This paper reveals that LLMs, despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing pure 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and OPT) and different LLM transformer blocks. We additionally propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding -- the pre-trained LLM transformer blocks discern informative visual tokens and further amplify their effect. This hypothesis is empirically supported by the observation that the feature activation, after training with LLM transformer blocks, exhibits a stronger focus on relevant regions. We hope that our work inspires new perspectives on utilizing LLMs and deepening our understanding of their underlying mechanisms. Code is available at https://github.com/ziqipang/LM4VisualEncoding.

PDF HTML Abstract

Insights into "Frozen Transformers in LLMs Are Effective Visual Encoder Layers"

The paper entitled "Frozen Transformers in LLMs Are Effective Visual Encoder Layers" presents a nuanced exploration into the application of LLMs as visual encoders, independent of conventional multi-modal frameworks. Through a comprehensive evaluation across a diversity of visual tasks, the authors demonstrate that a frozen transformer block from pre-trained LLMs significantly enhances visual encoding performance. This is achieved without involving language inputs or prompts, marking a clear departure from typical vision-LLMs which necessitate multi-modal integrations.

In their experiments, the authors consider a variety of visual tasks, including 2D and 3D recognition, temporal modeling, non-semantic tasks like motion forecasting, and even multi-modal tasks such as 2D/3D visual question answering. A key innovation described is the insertion of a pre-trained LLM transformer block into existing visual encoders to act as a feature processing layer, effectively bridging the gap between textual knowledge and visual data representation. The authors argue that this approach leverages the rich semantic priors encapsulated within LLMs, which are adept at discerning and amplifying informative visual tokens, even when devoid of direct exposure to visual data during pre-training.

A significant contribution of the paper is the proposal of the "information filtering hypothesis". This hypothesis suggests that the effectiveness of LLM transformer blocks in visual encoding lies in their ability to filter and amplify informative visual tokens. By highlighting relevant regions in the visual field with heightened feature activation, these transformers guide the model towards more semantically meaningful representations. This hypothesis is empirically supported by experiments showing a pronounced concentration on relevant visual regions after integration of the LLM transformers.

Experimentally, the authors reveal that performances across tasks exhibit consistent improvement when integrating these frozen blocks. For instance, image classification benchmarks demonstrate notable gains in both standard accuracy and robustness to noise and adversarial examples. Such enhancements are further corroborated in point cloud recognition, video-based action recognition, and motion forecasting, underscoring the versatility and robustness of the proposed method.

Moreover, the paper provides a critical examination of the scalability of this approach, illustrating that the benefits of incorporating LLM transformers become pronounced only at sufficient model scales, such as those of LLaMA and OPT models. Additionally, analytical experiments underscore the influence of LLM transformer depth, revealing that different layers impart distinct enhancements, with the final transformer blocks often yielding optimal results across tasks.

Despite the promising results, the paper maintains a critical perspective, acknowledging that while the hypothesis offers a robust framework for understanding the benefits of frozen transformers, further inquiry is warranted to delineate the roles of individual network layers and the dynamics of the training process. The ongoing exploration and experimental validation are expected to deepen the understanding of LLMs in visual tasks, potentially catalyzing new avenues of research in AI.

In conclusion, this paper provides a thought-provoking step forward in the exploration of LLMs for visual data processing, challenging existing paradigms of vision-language integration. The author's approach inspires a reevaluation of how textual machine learning advances can be symbiotically applied to computer vision tasks, suggesting broader implications for multimodal learning and representation theory in AI research.