ChatRex: Taming Multimodal LLM for Joint Perception and Understanding (2411.18363v2)

Published 27 Nov 2024 in cs.CV

Abstract: Perception and understanding are two pillars of computer vision. While multimodal LLMs (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After standard two-stage training, ChatRex demonstrates strong perception capabilities while preserving multimodal understanding performance. The combination of these two capabilities simultaneously unlocks many attractive applications, demonstrating the complementary roles of both perception and understanding in MLLM. Code is available at \url{https://github.com/IDEA-Research/ChatRex}.

PDF HTML Abstract

Insights into ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

The paper under analysis introduces ChatRex, a Multimodal LLM (MLLM) aimed at enhancing both perception and understanding within multimodal AI systems. ChatRex endeavors to address the discernible gap between perception and understanding in existing MLLMs, which often excel in visual understanding tasks but tend to underperform in perception tasks such as object detection. The proposed solution integrates architectural and data-centric innovations to ameliorate these shortcomings effectively.

Model Architecture and Design Innovations

ChatRex diverges from traditional MLLMs by adopting a decoupled perception design. In contrast to the existing models that directly predict bounding box coordinates, ChatRex employs a Universal Proposal Network (UPN) to feed proposed object boxes into the LLM. This design choice reconfigures the regression task into a retrieval task, a domain where LLMs exhibit notably stronger capabilities. By leveraging a RoI Align framework to enrich the proposed boxes with positional embeddings before inputting them into the LLM, ChatRex mitigates the error propagation and quantization issues that hinder traditional token-based coordinate predictions.

Moreover, ChatRex's dual-encoder setup, consisting of disparate vision encoders with varying resolutions, empowers the model to incorporate high-resolution visual data alongside reduced-resolution inputs. This strategic integration allows for a richer and more detailed feature map, significantly enhancing its perceptual acuity. The fusion of visual tokens through gate convolution effectively reduces redundancy and computational overhead, providing a streamlined yet comprehensive input to the multimodal tasks.

Data Development and Training Methodology

A significant facet of ChatRex lies in its data innovation, encapsulated in the Rexverse-2M dataset. Constructed through an automated data engine, this dataset encompasses image-region-text annotation triplets across multiple granularities, tailored to foster joint training on perception and understanding tasks. The pipeline integrates object grounding with region-specific captions, conditioned by noun phrases derived from high-level image captions. This nuanced data synthesis ensures a well-rounded learning experience for the model, enhancing both detection precision and contextual understanding.

The two-stage training regimen, initiated with alignment training followed by visual instruction tuning, facilitates a balanced optimization of both perception and understanding capabilities within the ChatRex model. Besides, the dataset's integration of fine-grained annotations facilitates its role in reducing hallucinations—a frequent challenge in high-level visual understanding tasks—and in improving interaction by anchoring dialog responses to perceptible objects.

Evaluation and Implications

Evaluation against common and long-tailed object detection tasks on benchmarks like COCO and LVIS showcases the efficacy of ChatRex in addressing traditional MLLM deficiencies. The model attains superior recall and precision metrics, thereby validating the profitability of its novel design. Furthermore, ChatRex's performance on referring expression comprehension tasks reinforces the value of its perception and understanding synergy.

The paper prudently refrains from exaggerated claims, providing a measured analysis of the model's performance. However, it accentuates the necessity of integrating perception and understanding to unlock advanced applications in AI domains such as autonomous vehicles, robotic vision, and interactive AI systems. Through its joint capabilities, ChatRex exemplifies its potential in broadly enhancing multimodal applications, offering more reliable and contextually aware systems.

Future Prospects

The advancements presented by ChatRex underscore a critical evolution towards more holistic AI models capable of nuanced perception and detailed understanding. Continued research might aim to refine the balance of these faculties further, perhaps by introducing more sophisticated fusion techniques or exploring additional modalities. Furthermore, expanding the dataset to cover more diverse scenarios and edge cases could further solidify the model's robustness and adaptability, broadening its real-world applicability.

In summary, ChatRex stands as a promising development in the trajectory of MLLMs, tackling intrinsic deficiencies through a well-conceived amalgamation of architectural restructuring and data innovation. As AI continues to evolve, the principles demonstrated herein may well set a precedent for future investigations into the symbiosis of perception and understanding in artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Qing Jiang (30 papers)
Gen Luo (32 papers)
Yuqin Yang (5 papers)
Yuda Xiong (4 papers)
Yihao Chen (40 papers)
Zhaoyang Zeng (29 papers)
Tianhe Ren (25 papers)
Lei Zhang (1689 papers)

Related Papers

Find Related Papers

GitHub

GitHub - IDEA-Research/ChatRex: Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding (21 stars)

Tweets

https://twitter.com/gm8xx8/status/1862954913581449551

https://twitter.com/IAMJBDEL/status/1862647162817192083

https://twitter.com/mountch1cken/status/1862008949345837081

https://twitter.com/arXivGPT/status/1862561808613020041

https://twitter.com/jbohnslav/status/1863589934067601772

https://twitter.com/zhouql1978/status/1887461869868491073