DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
The paper "DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception," authored by Xiaotong Li et al., introduces a novel approach to enhance the perceptual capabilities of Multimodal LLMs (MLLMs) through hyper-detailed image descriptions. This approach is predicated on integrating various specialized vision experts to construct a high-quality image-text dataset named DenseFusion-1M. The work addresses the significant challenge of acquiring detailed and comprehensive multimodal datasets, which are essential for training MLLMs to accurately perceive and interpret diverse visual information.
Introduction and Motivation
The existing MLLMs have demonstrated considerable progress in multimodal understanding and reasoning by leveraging Large Vision Models (LVMs) and LLMs. However, their performance is constrained by the limited availability of high-quality, detailed image-text datasets. Traditional caption engines fail to provide detailed and accurate annotations necessary for comprehensive visual perception. This paper proposes a solution through Perceptual Fusion, employing a low-budget yet effective caption engine that integrates various vision experts to generate dense and accurate image descriptions.
Methodology
The methodology involves a two-step process: data pre-processing and perceptual fusion. The authors selected 1 million highly representative images from the LAION dataset, ensuring high resolution and diversity, resulting in a subset named DenseFusion-1M. They then devised a Perceptual Fusion strategy, combining insights from multiple vision experts, including image tagging, object detection, text recognition, and world knowledge.
Data Pre-Processing
The data pre-processing phase involves:
- High-Resolution Image Selection: Filtering images with a minimum short-edge resolution of 448 pixels to ensure rich visual content.
- Semantic Clustering and De-duplication: Using k-means clustering on image features extracted via EVA-CLIP and removing semantic duplicates within clusters to maintain diverse and high-quality data.
Perceptual Fusion
The Perceptual Fusion pipeline integrates multiple vision experts:
- Image Tagging: Utilizing RAM++ for scene-level understanding.
- Object Detection: Employing EVA02 for closed-set object detection and OWL-ViTv2 for open-set detection to recognize a wide range of objects.
- Text Recognition: Leveraging OCR models to capture textual information within images.
- World Knowledge: Incorporating context and background information from LAION's short captions.
The fusion strategy combines these elements to feed supplementary information into an efficient MLLM, using GPT-4V for generating initial captions that guide the training of a robust caption engine based on LLaVA-1.6.
Dataset Description
The DenseFusion-1M dataset comprises 1 million hyper-detailed image-text pairs, enhancing the semantic richness and detail of the image descriptions. The dataset offers comprehensive annotations that include object attributes, spatial relations, text information, and world knowledge. Each description averages 190 words and 11 sentences, significantly enriching the input data for multimodal training.
Experiments and Results
The authors conducted extensive experiments to validate the effectiveness of DenseFusion-1M across various vision-language benchmarks such as VQAv2, GQA, TextVQA, and others. The results demonstrated that models trained with DenseFusion-1M outperformed state-of-the-art models, particularly in tasks requiring detailed visual perception. The use of high-resolution images further amplified the benefits, showcasing substantial improvements in text recognition and high-resolution image perception.
Implications and Future Work
The DenseFusion-1M dataset sets a new standard for multimodal datasets, enabling MLLMs to achieve better vision-language alignment through detailed and accurate annotations. The implications of this work are significant for areas requiring meticulous visual understanding, such as autonomous driving, medical imaging, and advanced human-computer interaction.
Future work could explore the integration of additional vision experts and the application of DenseFusion-1M in broader contexts. The potential for enhancing conditional image generation tasks also warrants investigation, as demonstrated by the initial qualitative results in the paper.
Conclusion
The DenseFusion-1M dataset represents a substantial contribution to the field of multimodal perception. By integrating diverse vision experts and generating hyper-detailed image descriptions, this work provides a robust foundation for training advanced MLLMs. The detailed methodological approach and the significant improvements demonstrated across multiple benchmarks highlight the value and potential of this innovative dataset.