Efficient Multimodal Learning from Data-centric Perspective (2402.11530v3)

Published 18 Feb 2024 in cs.CV

Abstract: Multimodal LLMs (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs in both training and inference, limiting accessibility to the broader research and user communities. A straightforward solution is to leverage smaller pre-trained vision and LLMs, which inevitably cause significant performance drops. In this paper, we demonstrate the possibility of training a smaller but better MLLM with high-quality training data. Specifically, we introduce Bunny, a family of lightweight MLLMs with flexible vision and language backbones for efficient multimodal learning from selected training data. Experiments show that our Bunny-4B/8B outperforms the state-of-the-art large MLLMs on multiple benchmarks. We expect that this work can provide the community with a clean and flexible open-source tool for further research and development. The code, models, and data can be found in https://github.com/BAAI-DCAI/Bunny.

PDF Abstract

Efficient Multimodal Learning from Data-centric Perspective

The paper "Efficient Multimodal Learning from Data-centric Perspective" addresses the limitations associated with Multimodal LLMs (MLLMs), particularly focusing on overcoming the computational costs associated with training and deploying such systems. Multimodal models have demonstrated superior capabilities in visual understanding and reasoning, yet their widespread use is inhibited due to resource constraints. This work introduces Bunny, a family of lightweight MLLMs that strive to maintain performance while reducing computational demands through a data-centric approach.

Overview of Bunny

Bunny challenges the conventional scaling law in machine learning, which typically implies that larger models perform better due to increased capacity. Instead, Bunny models achieve competitive performance by harnessing more informative and condensed training data rather than merely increasing model size. The proposed Bunny framework offers a modular architecture with flexible vision and language backbones, supporting the integration of various lightweight pre-trained models. Key components of Bunny include:

Vision Encoders: Options such as SigLIP and EVA-CLIP, known for their efficiency in language-image alignment.
Language Backbones: Options include Phi-1.5, StableLM-2, and Phi-2, which represent state-of-the-art lightweight LLMs.
Cross-modality Projector: A mechanism aligning the vision and language modalities to facilitate multimodal learning.

Data-centric Approach

A pivotal aspect of Bunny is its focus on data rather than model architecture alone. By employing a meticulous dataset condensation process, Bunny selects high-quality, informative subsets from extensive data pools such as LAION-2B. This process involves clustering embeddings of images and texts, followed by selecting data based on their relational distances to cluster centroids, resulting in the Bunny-pretrain-LAION-2M dataset. This approach ensures that the training data used are not only scalable but optimized for model efficiency.

Experimental Results

The empirical evaluation of Bunny models across various benchmarks showcases their impressive performance relative to both similarly-sized competitors and larger-scale MLLMs. Specifically, Bunny-3B, which integrates SigLIP and Phi-2, surpasses existing models such as LLaVA-v1.5-13B in tasks including MMBench and SEED-Bench, despite a significantly reduced model size. These results underscore the viability of the data-centric approach in compensating for decreased model dimensions, evidencing superior data efficiency.

Implications and Future Directions

The implications of this research are profound for both practical deployment and theoretical understanding of AI models:

Practical Deployment: The reduction in computational demands facilitates broader accessibility and implementation of robust multimodal models in various applications and on diverse platforms, including resource-limited environments.
Theoretical Exploration: This work opens new avenues in foundation model development by prioritizing data quality and selection over sheer scale. The findings could lead to more refined paradigms in model optimization and data utilization.

Future research could explore expanding Bunny's modular framework with additional backbone options and refining dataset condensation techniques to enhance model robustness further. These advancements are expected to broaden the practical application spectrum of MLLMs and contribute to lesser dependence on expansive computational infrastructures.

Conclusion

Ultimately, the advent of Bunny models illustrates the potential of efficient, data-driven approaches to multimodal learning. By focusing on the quality and strategic selection of training data, Bunny represents a significant stride towards scalable, high-performance AI systems that challenge the traditional reliance on model size and computational power in favor of nuanced data efficiency.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Muyang He (6 papers)
Yexin Liu (25 papers)
Boya Wu (5 papers)
Jianhao Yuan (10 papers)
Yueze Wang (14 papers)
Tiejun Huang (130 papers)
Bo Zhao (242 papers)

Citations (62)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos