Data curation via joint example selection further accelerates multimodal learning (2406.17711v1)

Published 25 Jun 2024 in cs.LG and cs.AI

Abstract: Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from larger super-batches, we also leverage recent advances in model approximation to reduce the associated computational overhead. As a result, our approach--multimodal contrastive learning with joint example selection (JEST)--surpasses state-of-the-art models with up to 13$\times$ fewer iterations and 10$\times$ less computation. Essential to the performance of JEST is the ability to steer the data selection process towards the distribution of smaller, well-curated datasets via pretrained reference models, exposing the level of data curation as a new dimension for neural scaling laws.

Citations (5)

View on Semantic Scholar

Summary

The paper introduces JEST, a joint example selection method that enhances batch-based data curation for efficient multimodal learning.
It utilizes learnability scoring derived from both learner and reference model losses to select highly informative data batches.
Experimental results demonstrate state-of-the-art efficiency with up to 90% fewer FLOPs compared to conventional methods.

Data Curation via Joint Example Selection Further Accelerates Multimodal Learning

This paper presents a robust approach to data curation for large-scale multimodal learning, introducing a method termed Joint Example Selection (JEST). The central thesis of the paper is that curating data in batches, rather than individual examples, enhances multimodal learning efficiency significantly. This technique leverages contrastive objectives to reveal dependencies between data points, thereby enabling the selection of the most "learnable" batches based on their joint characteristics.

Key Contributions and Methods

The authors propose JEST, an algorithm for selecting highly learnable sub-batches from much larger super-batches based on the scoring of the data using model-based criteria. These criteria are derived from the contrastive losses and the alignment of data points within a batch. By utilizing a combination of the learner's loss and a pretrained reference model's loss (learnability scoring), JEST can prioritize examples that are not only challenging for the current model but also deemed high quality based on the reference model.

Core Innovations:

Learnability Scoring: This metric combines the difficulty of examples for the learner and their ease for the reference model, ensuring the chosen examples are both informative and of high quality.
Batch-level Selection: By focusing on the joint characteristics of batches, the method captures the interactions between examples, which individual selection methods miss.
Efficient Implementation: The authors introduce an efficient scoring mechanism leveraging online model approximations, specifically using FlexiViT architecture to reduce computational overhead.

Experimental Results

The empirical evaluations demonstrate that JEST drastically accelerates learning in multimodal tasks compared to standard independent data selection methods. Notably, the following points stand out:

Training Efficiency: JEST achieves state-of-the-art performance using approximately 10 times fewer FLOPs and 13 times fewer training iterations compared to traditional methods. For instance, the JEST++ variant outperforms the SigLIP model trained on 40 billion examples with only 4 billion examples, while using WebLI-curated++ as the reference.
Data Quality Bootstrapping: The concept of using small, well-curated datasets (reference models) to guide the learning process on much larger, uncurated datasets showcases a robust scaling potential. The results indicate a significant decoupling between the performance of a reference model and the JEST-trained model, emphasizing the method's effectiveness in bootstrapping data quality.
FLOP Efficiency: Flexi-JEST achieves notable reductions in computational cost, with the best variants performing comparably to state-of-the-art models but with up to 90% fewer FLOPs.

Practical Implications and Future Directions

The implications of this research are profound for both practical and theoretical aspects of AI and machine learning. On a practical level, the ability to significantly reduce the computational resources required for training large-scale models makes this approach highly attractive for industry applications, where resources are often a limiting factor. Furthermore, the method's robustness and scalability could facilitate more accessible and sustainable AI developments.

Theoretically, the concept of learnability scoring and joint example selection opens new avenues for optimizing data curation strategies, potentially influencing how future AI models are trained. Here are some potential future developments in AI based on this research:

Dynamic Data Curation: Extending the concept of learnability to dynamic data environments could lead to models that continuously optimize their training data in real-time, adapting to new information as it becomes available.
Enhanced Model Interpretability: Understanding the characteristics of highly learnable batches might provide insights into the intricacies of model learning and the inherent structure of data, potentially leading to better interpretability of AI models.
Cross-Domain Applications: The principles of joint example selection could be adapted to other domains beyond multimodal learning, such as reinforcement learning, where the quality of sampled experiences significantly impacts learning efficiency.

Conclusion

The paper presents a significant advancement in the field of data curation for AI, providing empirical evidence that joint example selection markedly enhances learning efficiency in multimodal settings. By leveraging learnability scoring and batch-level selection, the proposed JEST method not only accelerates training but also improves the practicality of deploying large-scale AI models. These results suggest a promising future where data curation is sophisticatedly aligned with the demands of the learning process, maximizing the potential of AI across various applications.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (4)

Tweets

https://twitter.com/olivierhenaff/status/1805995802352910557

https://twitter.com/talfanevans/status/1805996144662573467

https://twitter.com/burny_tech/status/1811212733246746723

https://twitter.com/nikparth1/status/1806002517492314295

https://twitter.com/IridiumEagle/status/1810424938316415176

https://twitter.com/4ndr3aR/status/1810620271683940371

YouTube

Show All Videos

HackerNews

Data curation via joint example selection (3 points, 0 comments)
Data curation via joint example selection accelerates multimodal learning (2 points, 0 comments)