The Synergy between Data and Multi-Modal LLMs: A Survey from Co-Development Perspective
This survey paper titled "The Synergy between Data and Multi-Modal LLMs: A Survey from Co-Development Perspective" systematically examines the interplay between multi-modal data and Multi-Modal LLMs (MLLMs). It posits that the development trajectories of data and MLLMs are intertwined rather than parallel, emphasizing the need for a co-development approach. This paper provides a thorough review of recent works that illustrate how data-centric approaches can enhance MLLM capabilities and how MLLMs can reciprocally enrich data curation processes. The survey is detailed and structured, covering both theoretical insights and practical implications, reflecting the authors' extensive research in the field.
Data Contributions for Scaling MLLMs
The initial part of the paper deals with the scalability of MLLMs, focusing on data acquisition, augmentation, and diversity. As MLLMs scale, their requirement for vast data volumes increases exponentially. The survey covers various data acquisition strategies, including web scraping, merging existing datasets, manual curation, and employing well-trained models like GPT-4V for automatic data generation. These approaches cater to different training stages of MLLMs, such as pretraining encoders, projectors, and fine-tuning for specific tasks.
Data augmentation techniques, particularly those leveraging LLMs and MLLMs, are highlighted for their ability to enhance dataset diversity and balance. For example, using LLMs to rewrite text descriptions can significantly improve text diversity while maintaining semantic integrity. Additionally, specific data-centric methods for imbalanced datasets, such as generating negative samples to balance classes, are discussed in detail.
The survey also explores strategies for improving the scaling effectiveness of datasets. These include data condensation methods—data deduplication, filtering low-quality data, and constructing kernel sets—that reduce data redundancy and enhance information density. Effective data mixture approaches are outlined to mitigate or leverage distribution biases, thereby optimizing data proportions at both dataset and batch levels. The importance of data packing techniques for long-context support and better pretraining convergence is also emphasized. Finally, cross-modal alignment techniques, predominantly CLIP score-based methods and text-centric anchoring, are discussed to ensure correct matching between different modalities in datasets.
Data Contributions for Usability Enhancement
Beyond scaling, the usability of MLLMs is crucial for practical applications. The paper categorizes enhancement techniques for instruction responsiveness, reasoning abilities, ethical considerations, and evaluation benchmarks. Instruction responsiveness can be improved through prompt design, high-quality ICL data, and human-behavior alignment datasets. These methods guide MLLMs to better understand and follow human instructions.
The survey identifies data-centric approaches to fortify MLLMs' reasoning abilities, covering single-hop and multi-hop reasoning with a particular focus on Chain-of-Thought (CoT) techniques. Ethical considerations, such as data toxicity and privacy, are extensively discussed. The paper reviews data-centric attack and defense strategies against toxic content and outlines privacy-preserving techniques like differentially private training and federated learning.
Comprehensive benchmarks are crucial for evaluating MLLMs' performance across various dimensions—understanding, generation, retrieval, and reasoning. The survey lists numerous benchmarks and provides insights into their contributions to assessing and improving MLLMs systematically.
Model Contributions for Multi-Modal Data
In addition to how data enhances MLLMs, the paper explores the reverse direction: how models can contribute to data. This includes roles such as data creator, mapper, filter, and evaluator. MLLMs can generate data, refine existing data through transformations such as summarization and annotation, filter data based on quality assessments, and evaluate data to provide feedback on quality and ethics.
The paper highlights the potential of models to serve as data scientists, automating tasks like navigation, extraction, analysis, and visualization of multi-modal data. These capabilities reduce labor-intensive efforts and provide new insights for dataset curation and analysis.
Future Directions
The authors outline a roadmap for future research, emphasizing the need for infrastructural advancements to support data-model co-development and proposing several promising directions. These include enhancing automated data discovery, modality-compatibility detection, and knowledge transfer among models. The paper also discusses the potential of self-boosted development paradigms, where MLLMs iteratively improve both themselves and their training data in an autonomous cycle.
Implications and Conclusion
The survey highlights the symbiotic relationship between multi-modal data and MLLMs, showing that advances in one can significantly propel the other. The comprehensive compilation of methods and future directions provides valuable guidance for researchers and developers in the field of AI. This paper is essential reading for anyone interested in the cutting-edge development of large-scale multi-modal AI models, presenting a meticulous review and a visionary outlook on the future of AI.