- The paper presents a novel visual expert module that deeply fuses image and text features within a pretrained language model.
- It demonstrates superior performance on 14 cross-modal benchmark tests, outperforming or matching state-of-the-art models.
- CogVLM offers an open-source, efficient solution that preserves language capabilities while integrating rich visual information.
Introduction to Visual LLMs (VLMs)
Visual LLMs (VLMs) have emerged as robust tools capable of understanding and generating content across both visual and textual domains. These models can tackle tasks such as image captioning, visual question answering (VQA), visual grounding, and more. VLMs have also exhibited the ability to learn context dynamically, improving their performance on downstream tasks as their size scales.
Challenges in Training VLMs
Training high-performance VLMs that maintain their language capabilities while incorporating visual understanding is a complex task. The traditional method involves a 'shallow alignment' strategy, connecting a pretrained vision encoder and LLM via a trainable modular component. However, these models tend to converge quickly but do not attain the same performance levels as models where vision and language components are trained together. This gap arises because the vision and language features are not deeply integrated in shallow alignment methods. Deeply fusing these features while retaining NLP capabilities remains a key challenge in the field.
Introducing CogVLM
CogVLM presents a solution to the deep fusion challenge by incorporating a 'visual expert module' within the layers of a pretrained LLM. This module, which includes distinct transformation matrices for image features and an adapted feedforward neural network, allows for rich visual-language feature integration without increasing the computational load. It effectively retains the original behavior of the LLM when dealing with text-only inputs. CogVLM demonstrates a remarkable performance on 14 cross-modal benchmark tests, outperforming or achieving comparable results to state-of-the-art alternatives.
Benefits and Future Directions
CogVLM is open-source, which is significant since most preceding VLMs are proprietary, limiting research and application development capabilities. This model is highly applicable to both research and commercial uses, and its release is expected to significantly contribute to advancements in visual understanding. Future VLM development may explore aspects such as enhanced training alignment, reinforcement learning for human feedback, and strategies to reduce hallucination in generated content. With ongoing evolution in the field, CogVLM establishes a strong foundation for multisensory AI growth.