Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection (2410.10636v2)

Published 14 Oct 2024 in cs.LG and cs.AI

Abstract: Visual instruction datasets from various distributors are released at different times and often contain a significant number of semantically redundant text-image pairs, depending on their task compositions (i.e., skills) or reference sources. This redundancy greatly limits the efficient deployment of continually adaptable multimodal LLMs, hindering their ability to refine existing skills and acquire new competencies over time. We reframe the problem of lifelong Instruction Tuning (LiIT) via data selection, where the model automatically selects beneficial samples to learn from earlier and new datasets based on the current state of acquired knowledge in the model. We propose Adapt-$\infty$, a new multi-way and adaptive data selection approach that dynamically balances sample efficiency and effectiveness during LiIT. We first construct pseudo-skill clusters by grouping gradient-based sample vectors. Next, we select the best-performing data selector for each skill cluster from a pool of selector experts, including our newly proposed scoring function, Image Grounding score. This data selector samples a subset of the most important samples from each skill cluster for training. To prevent the continuous increase in the size of the dataset pool during LiIT, we introduce a cluster-wise permanent data pruning strategy to remove the most semantically redundant samples from each cluster, keeping computational requirements manageable. We validate the effectiveness and efficiency of Adapt-$\infty$ over a sequence of multimodal instruction tuning datasets with various tasks, including (Knowledge) VQA, multilingual, grounding, reasoning, language-only, and multi-image comprehension. Training with samples selected by Adapt-$\infty$ alleviates catastrophic forgetting, especially for rare tasks, and promotes forward transfer across the continuum using only a fraction of the original data.

Summary

The paper introduces Adapt-$, a dynamic data selection framework that addresses redundancy and optimizes multimodal instruction tuning for lifelong learning.
It employs pseudo-task clustering and multi-way selection with an innovative Image Grounding score to prioritize the most informative, visually grounded samples.
The approach significantly mitigates catastrophic forgetting and boosts forward skill transfer, achieving over 100% relative gains in training efficiency.

Adapt-$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection</h2> <p>The paper "Adapt-$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection" presents an innovative approach to improving the adaptability and proficiency of Multimodal LLMs (MLLMs). The authors address the challenges of Lifelong Instruction Tuning (LiIT) by introducing a new strategy for dynamic data selection called Adapt- $, which fosters efficient training on multimodal datasets that are iteratively updated with new data over time.</p> <h3 class='paper-heading' id='problem-and-methodology'>Problem and Methodology</h3> <p>The primary challenge highlighted is the redundancy present in large, sequentially released visual instruction datasets. This redundancy inhibits <a href="https://www.emergentmind.com/topics/multi-language-large-models-mllms" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">MLLMs</a> from refining previously acquired skills while integrating new capabilities. The authors propose Adapt-$ to tackle this inefficiency by selectively curating data from both existing and new datasets based on relevance to the model's current state.

Adapt- $operates through a series of systematic steps:</p> <ol> <li><strong>Pseudo-task Clustering</strong>: The authors use gradient-based sample vectors to construct pseudo-skill clusters. This technique categorizes data samples into groups that represent similar skills, helping to preserve skill diversity during training.</li> <li><strong>Multi-way Data Selection</strong>: The framework evaluates and selects the most informative sample subset for each pseudo-skill cluster using a combination of scoring functions. A novel scoring function, the Image Grounding score, is introduced to measure the influence of visual information on sample perplexity, effectively prioritizing visually grounded samples.</li> <li><strong>Cluster-wise Data Pruning</strong>: To manage computational resources during LiIT, a data pruning strategy is implemented to eliminate semantically redundant samples from each cluster, ensuring that the dataset remains a balanced and efficient representation.</li> </ol> <h3 class='paper-heading' id='results-and-implications'>Results and Implications</h3> <p>Empirical validation on multiple multimodal <a href="https://www.emergentmind.com/topics/instruction-tuning-it" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">instruction tuning</a> datasets, such as VQA and multilingual tasks, demonstrates Adapt-$ 's ability to enhance forward skill transfer and mitigate catastrophic forgetting using only a fraction of the original data. The method achieves significant improvements in training efficiency, achieving greater than 100% relative gains in preserving and extending comprehensive skill sets.

The research provides a robust framework for sustaining lifelong learning in AI models, specifically in handling evolving multimodal content. By integrating adaptive data selection and skill retention mechanisms, Adapt- $supports scalable and efficient training paradigms that could inspire future work in continual learning methodologies for <a href="https://www.emergentmind.com/topics/multimodal-ai" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">multimodal AI</a> systems.</p> <h3 class='paper-heading' id='future-directions'>Future Directions</h3> <p>Speculation on future developments indicates a potential extension of Adapt-$ to address even broader datasets and more complex tasks. There may also be an exploration into further enhancing scoring functions and clustering mechanisms to refine the model's ability to discern sample importance dynamically.

In summary, the paper introduces a well-founded approach to advancing the capabilities of MLLMs in lifelong learning scenarios, offering both theoretical innovation and practical improvements. Adapt-$ provides a significant contribution to scalable AI models that can adapt to constant information influxes, setting a promising direction for future AI research and application.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jaeh0ng_yoon/status/1846233361016410588

https://twitter.com/mohitban47/status/1884312812182446572