Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic

Published 10 Apr 2024 in cs.LG | (2404.07177v1)

Abstract: Vision-LLMs (VLMs) are trained for thousands of GPU hours on carefully curated web datasets. In recent times, data curation has gained prominence with several works developing strategies to retain 'high-quality' subsets of 'raw' scraped data. For instance, the LAION public dataset retained only 10% of the total crawled data. However, these strategies are typically developed agnostic of the available compute for training. In this paper, we first demonstrate that making filtering decisions independent of training compute is often suboptimal: the limited high-quality data rapidly loses its utility when repeated, eventually requiring the inclusion of 'unseen' but 'lower-quality' data. To address this quality-quantity tradeoff ($\texttt{QQT}$), we introduce neural scaling laws that account for the non-homogeneous nature of web data, an angle ignored in existing literature. Our scaling laws (i) characterize the $\textit{differing}$ 'utility' of various quality subsets of web data; (ii) account for how utility diminishes for a data point at its 'nth' repetition; and (iii) formulate the mutual interaction of various data pools when combined, enabling the estimation of model performance on a combination of multiple data pools without ever jointly training on them. Our key message is that data curation $\textit{cannot}$ be agnostic of the total compute that a model will be trained for. Our scaling laws allow us to curate the best possible pool for achieving top performance on Datacomp at various compute budgets, carving out a pareto-frontier for data curation. Code is available at https://github.com/locuslab/scaling_laws_data_filtering.

Abstract PDF HTML Upgrade to Chat

Authors (5)

References (54)

Citations (21)

View on Semantic Scholar

Summary

The paper presents neural scaling laws that model the diminishing returns of high-quality data in VLM training.
It demonstrates that aggressive filtering enhances performance at low compute budgets, while broader data inclusion benefits high compute budgets.
The study offers practical methodologies to adjust data curation strategies based on computational constraints for optimal model training.

Scaling Laws for Data Filtering: Adapting to Computational Constraints in Vision-LLM Training

Introduction to Quality-Quantity Tradeoff in Data Curation

The effective training of Vision-LLMs (VLMs) pivots significantly on the curation of the underlying datasets. Recent methodologies emphasize the stratification of web-scraped data into “high-quality” subsets for model training. This paper illuminates a critical dynamic in this process: the quality-quantity tradeoff (QQT). QQT encapsulates the diminishing returns of leveraging high-quality data beyond a certain extent, necessitating the inclusion of lower-quality data for optimal performance. This phenomenon underscores the necessity of curating data pools in tandem with computational constraints, challenging the prevailing data-agnostic approaches to filtering.

Theoretical Underpinnings of Data Utility and Decay

The introduction of neural scaling laws that account for the non-homogeneous nature of web data represents a significant advancement in the domain. These scaling laws enable:

Characterization of the differential utility of web data subsets.
Quantification of the diminishing utility of data upon repetition.
Estimation of model performance across combinations of data pools without necessitating joint training on them.

This framework posits that the utility of data decreases not only with the amount of data already seen but also with each repetition of the data, conceptualizing this decay in utility through an innovative formulation.

Empirical Evaluation and Observations

Empirical investigations validate the aforementioned theoretical model. The study utilizes a benchmark dataset, partitioned into subsets based on data quality, to train models under various computational budgets. Key observations include:

At low compute budgets, aggressive filtering to retain only high-quality data yields superior performance.
Under high computational budgets, the strategy shifts, advocating for inclusion of broader data pools to counteract the rapid utility decay of high-quality data.

These findings are instrumental in illustrating the need for compute-aware strategies in data filtering, challenging conventional practices that favor static, quality-centric curation methods.

Implications and Future Directions

The implications of this research are twofold. Practically, it equips practitioners with a methodology to tune their data curation strategies based on available computational resources, optimizing model performance. Theoretically, it sets a new direction for future inquiries into scaling laws for VLMs, especially in the context of heterogeneous and limited web data.

A prospective avenue for extending this work includes exploring the scalability of these scaling laws across vastly differing data pool sizes and incorporating the variance in data diversity when mixing pools of different qualities. Moreover, accounting for batch size variance in the context of contrastive training settings could refine the utility and applicability of these scaling laws further.

Conclusion

The research provides a paradigm shift in how data curation is approached for training large-scale VLMs, highlighting the interaction between data quality, quantity, and computational budget. By challenging the data-agnostic notions of quality in data filtering, this work paves the way for more nuanced and effective strategies in leveraging web-scale datasets for AI training.

Markdown Report Issue