CogVLM: Visual Expert for Pretrained Language Models (2311.03079v2)

Published 6 Nov 2023 in cs.CV

Abstract: We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of LLM, CogVLM bridges the gap between the frozen pretrained LLM and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

Citations (345)

View on Semantic Scholar

Summary

The paper presents a novel visual expert module that deeply fuses image and text features within a pretrained language model.
It demonstrates superior performance on 14 cross-modal benchmark tests, outperforming or matching state-of-the-art models.
CogVLM offers an open-source, efficient solution that preserves language capabilities while integrating rich visual information.

Introduction to Visual LLMs (VLMs)

Visual LLMs (VLMs) have emerged as robust tools capable of understanding and generating content across both visual and textual domains. These models can tackle tasks such as image captioning, visual question answering (VQA), visual grounding, and more. VLMs have also exhibited the ability to learn context dynamically, improving their performance on downstream tasks as their size scales.

Challenges in Training VLMs

Training high-performance VLMs that maintain their language capabilities while incorporating visual understanding is a complex task. The traditional method involves a 'shallow alignment' strategy, connecting a pretrained vision encoder and LLM via a trainable modular component. However, these models tend to converge quickly but do not attain the same performance levels as models where vision and language components are trained together. This gap arises because the vision and language features are not deeply integrated in shallow alignment methods. Deeply fusing these features while retaining NLP capabilities remains a key challenge in the field.

Introducing CogVLM

CogVLM presents a solution to the deep fusion challenge by incorporating a 'visual expert module' within the layers of a pretrained LLM. This module, which includes distinct transformation matrices for image features and an adapted feedforward neural network, allows for rich visual-language feature integration without increasing the computational load. It effectively retains the original behavior of the LLM when dealing with text-only inputs. CogVLM demonstrates a remarkable performance on 14 cross-modal benchmark tests, outperforming or achieving comparable results to state-of-the-art alternatives.

Benefits and Future Directions

CogVLM is open-source, which is significant since most preceding VLMs are proprietary, limiting research and application development capabilities. This model is highly applicable to both research and commercial uses, and its release is expected to significantly contribute to advancements in visual understanding. Future VLM development may explore aspects such as enhanced training alignment, reinforcement learning for human feedback, and strategies to reduce hallucination in generated content. With ongoing evolution in the field, CogVLM establishes a strong foundation for multisensory AI growth.

PDF Markdown

GitHub

GitHub - THUDM/CogVLM: a state-of-the-art-level open visual language model | 多模态预训练模型 (5,436 stars)

YouTube

Show All Videos