CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations (2402.04236v2)

Published 6 Feb 2024 in cs.CV and cs.CL

Abstract: Vision-LLMs (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual reasoning, further resulting in failures in meticulous visual problems and unfaithful responses. Drawing inspiration from human cognition in solving visual problems (e.g., marking, zoom in), this paper introduces Chain of Manipulations, a mechanism that enables VLMs to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) with results (e.g., boxes, image) actively without involving external tools, while also allowing users to trace error causes. We study the roadmap to implement this mechanism, including (1) a flexible design of manipulations upon extensive analysis, (2) an efficient automated data generation pipeline, (3) a compatible VLM architecture capable of multi-turn multi-image, and (4) a model training process for versatile capabilities. With the design, we also manually annotate 6K high-quality samples for the challenging graphical mathematical problems. Our trained model, \textbf{CogCoM}, equipped with this mechanism with 17B parameters achieves state-of-the-art performance across 9 benchmarks from 4 categories, demonstrating the effectiveness while preserving the interpretability. Our code, model weights, and collected data are publicly available at https://github.com/THUDM/CogCoM.

PDF Abstract

Introduction

In the field of AI research, the ability to align visual data with linguistic information is crucial, particularly for Vision-LLMs (VLMs) which are used in tasks like visual question answering, image captioning, and more. However, the conventional approach of training VLMs often leads to models that overlook intricate visual reasoning or fail to detect meticulous visual details. To delve into this issue, a paper introduced a mechanism named Chain of Manipulations (CoM) which fosters a deeper interaction between visual data and linguistic tasks.

Chain of Manipulations Mechanism

The core idea behind CoM is to enable VLMs to interpret visual data through a series of operations, or "manipulations", which are either inherent abilities gained through prior training or acquired by imitating human cognitive behaviors. This mechanism guides VLMs through a step-by-step process of evidence collecting and reasoning, drawing from details within the visual input. For instance, a model might first locate a particular object within an image before zooming in for finer detail or extracting text.

Data Synthesis and Model Training

To harness CoM, researchers devised a data synthesis algorithm using a mix of linguistic and visual annotators, such as powerful LLMs and cutting-edge recognition tools, to create chains of reasoning based on available image-question-answer datasets. After synthesizing these chains, a traversal process is applied to extract the most feasible paths leading to the correct answers. A model named CogCoM, a general 17B VLM, was developed using a compatible memory-based architecture that allows for such complex multimodal learning. The training incorporated these CoM chains to bolster the model's capabilities.

Experimental Outcomes

CogCoM's training involved a novel mix of instructional, grounding, detailed-captioning datasets, and CoM chains. The model was evaluated on eight key benchmarks spanning three categories of capabilities and displayed state-of-the-art performance across the board. More importantly, it exhibited robustness against hallucination and maintained competitive performance even with limited training steps. The paper also introduced a testbed involving meticulous visual problems with a keypoint-aware metric to assess the correctness of reasoning paths, wherein CogCoM outperformed existing models.

Concluding Thoughts

This paper posits a significant step forward in enhancing VLMs' ability to make faithful visual reasoning. The CoM mechanism shows promising potential in guiding VLMs through detailed and logical visual processing, akin to human cognition. While the method exhibits a need for diversity in linguistic steps and has room for improvement in the accuracy of visual tools, it nonetheless offers groundbreaking approaches to visual data interpretation and reasoning in AI models.