Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
86 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
53 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic (2404.07177v1)

Published 10 Apr 2024 in cs.LG

Abstract: Vision-LLMs (VLMs) are trained for thousands of GPU hours on carefully curated web datasets. In recent times, data curation has gained prominence with several works developing strategies to retain 'high-quality' subsets of 'raw' scraped data. For instance, the LAION public dataset retained only 10% of the total crawled data. However, these strategies are typically developed agnostic of the available compute for training. In this paper, we first demonstrate that making filtering decisions independent of training compute is often suboptimal: the limited high-quality data rapidly loses its utility when repeated, eventually requiring the inclusion of 'unseen' but 'lower-quality' data. To address this quality-quantity tradeoff ($\texttt{QQT}$), we introduce neural scaling laws that account for the non-homogeneous nature of web data, an angle ignored in existing literature. Our scaling laws (i) characterize the $\textit{differing}$ 'utility' of various quality subsets of web data; (ii) account for how utility diminishes for a data point at its 'nth' repetition; and (iii) formulate the mutual interaction of various data pools when combined, enabling the estimation of model performance on a combination of multiple data pools without ever jointly training on them. Our key message is that data curation $\textit{cannot}$ be agnostic of the total compute that a model will be trained for. Our scaling laws allow us to curate the best possible pool for achieving top performance on Datacomp at various compute budgets, carving out a pareto-frontier for data curation. Code is available at https://github.com/locuslab/scaling_laws_data_filtering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Revisiting neural scaling laws in language and vision, 2022.
  4. Explaining neural scaling laws, 2021.
  5. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems (NeurIPS), pages 9453–9463, 2019.
  6. Improving image generation with better captions. URL https://api.semanticscholar.org/CorpusID:264403242.
  7. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
  8. Broken neural scaling laws, 2023.
  9. Microsoft coco captions: Data collection and evaluation server, 2015.
  10. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, Oct 2017. ISSN 1558-2256. doi: 10.1109/jproc.2017.2675998. URL http://dx.doi.org/10.1109/JPROC.2017.2675998.
  11. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
  12. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  13. Data filtering networks, 2023.
  14. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004.
  15. Datacomp: In search of the next generation of multimodal datasets, 2023a.
  16. Datacomp: In search of the next generation of multimodal datasets, 2023b.
  17. Language models scale reliably with over-training and on downstream tasks, 2024.
  18. Tatsunori Hashimoto. Model performance scaling with multiple data sources. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4107–4116. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/hashimoto21a.html.
  19. Natural adversarial examples. arXiv preprint arXiv:1907.07174, 2019.
  20. The many faces of robustness: A critical analysis of out-of-distribution generalization. arXiv preprint arXiv:2006.16241, 2020.
  21. Scaling laws for transfer, 2021.
  22. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  23. Marcus Hutter. Learning curve theory, 2021.
  24. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  25. Scaling laws for downstream task performance of large language models, 2024.
  26. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  27. Scaling laws for neural language models, 2020.
  28. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  29. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  30. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023a.
  31. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023b.
  32. Sieve: Multimodal dataset pruning using image captioning models, 2023.
  33. T-mars: Improving visual representations by circumventing text feature learning. arXiv preprint arXiv:2307.03132, 2023.
  34. Rephrasing the web: A recipe for compute and data-efficient language modeling. arXiv preprint arXiv:2401.16380, 2024.
  35. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
  36. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264, 2023.
  37. Improving multimodal datasets with image captioning, 2023.
  38. M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  39. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  40. Filtering, distillation, and hard negatives for vision-language pre-training. arXiv:2301.02280, 2023.
  41. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), volume 139, pages 8748–8763, 2021a.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
  43. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  44. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning (ICML), 2019.
  45. Imagenet large scale visual recognition challenge, 2015.
  46. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws, 2023.
  47. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  48. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  49. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  50. Robust fine-tuning of zero-shot models. arXiv preprint arXiv:2109.01903, 2021.
  51. Cit: Curation in training for effective vision-language data. arXiv preprint arXiv:2301.02241, 2023.
  52. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. doi: 10.1162/tacl_a_00166. URL https://aclanthology.org/Q14-1006.
  53. The devil is in the details: A deep dive into the rabbit hole of data filtering, 2023.
  54. A large-scale study of representation learning with the visual task adaptation benchmark, 2020.
Citations (21)

Summary

  • The paper presents neural scaling laws that model the diminishing returns of high-quality data in VLM training.
  • It demonstrates that aggressive filtering enhances performance at low compute budgets, while broader data inclusion benefits high compute budgets.
  • The study offers practical methodologies to adjust data curation strategies based on computational constraints for optimal model training.

Scaling Laws for Data Filtering: Adapting to Computational Constraints in Vision-LLM Training

Introduction to Quality-Quantity Tradeoff in Data Curation

The effective training of Vision-LLMs (VLMs) pivots significantly on the curation of the underlying datasets. Recent methodologies emphasize the stratification of web-scraped data into “high-quality” subsets for model training. This paper illuminates a critical dynamic in this process: the quality-quantity tradeoff (QQT). QQT encapsulates the diminishing returns of leveraging high-quality data beyond a certain extent, necessitating the inclusion of lower-quality data for optimal performance. This phenomenon underscores the necessity of curating data pools in tandem with computational constraints, challenging the prevailing data-agnostic approaches to filtering.

Theoretical Underpinnings of Data Utility and Decay

The introduction of neural scaling laws that account for the non-homogeneous nature of web data represents a significant advancement in the domain. These scaling laws enable:

  • Characterization of the differential utility of web data subsets.
  • Quantification of the diminishing utility of data upon repetition.
  • Estimation of model performance across combinations of data pools without necessitating joint training on them.

This framework posits that the utility of data decreases not only with the amount of data already seen but also with each repetition of the data, conceptualizing this decay in utility through an innovative formulation.

Empirical Evaluation and Observations

Empirical investigations validate the aforementioned theoretical model. The paper utilizes a benchmark dataset, partitioned into subsets based on data quality, to train models under various computational budgets. Key observations include:

  • At low compute budgets, aggressive filtering to retain only high-quality data yields superior performance.
  • Under high computational budgets, the strategy shifts, advocating for inclusion of broader data pools to counteract the rapid utility decay of high-quality data.

These findings are instrumental in illustrating the need for compute-aware strategies in data filtering, challenging conventional practices that favor static, quality-centric curation methods.

Implications and Future Directions

The implications of this research are twofold. Practically, it equips practitioners with a methodology to tune their data curation strategies based on available computational resources, optimizing model performance. Theoretically, it sets a new direction for future inquiries into scaling laws for VLMs, especially in the context of heterogeneous and limited web data.

A prospective avenue for extending this work includes exploring the scalability of these scaling laws across vastly differing data pool sizes and incorporating the variance in data diversity when mixing pools of different qualities. Moreover, accounting for batch size variance in the context of contrastive training settings could refine the utility and applicability of these scaling laws further.

Conclusion

The research provides a paradigm shift in how data curation is approached for training large-scale VLMs, highlighting the interaction between data quality, quantity, and computational budget. By challenging the data-agnostic notions of quality in data filtering, this work paves the way for more nuanced and effective strategies in leveraging web-scale datasets for AI training.