Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic (2404.07177v1)
Abstract: Vision-LLMs (VLMs) are trained for thousands of GPU hours on carefully curated web datasets. In recent times, data curation has gained prominence with several works developing strategies to retain 'high-quality' subsets of 'raw' scraped data. For instance, the LAION public dataset retained only 10% of the total crawled data. However, these strategies are typically developed agnostic of the available compute for training. In this paper, we first demonstrate that making filtering decisions independent of training compute is often suboptimal: the limited high-quality data rapidly loses its utility when repeated, eventually requiring the inclusion of 'unseen' but 'lower-quality' data. To address this quality-quantity tradeoff ($\texttt{QQT}$), we introduce neural scaling laws that account for the non-homogeneous nature of web data, an angle ignored in existing literature. Our scaling laws (i) characterize the $\textit{differing}$ 'utility' of various quality subsets of web data; (ii) account for how utility diminishes for a data point at its 'nth' repetition; and (iii) formulate the mutual interaction of various data pools when combined, enabling the estimation of model performance on a combination of multiple data pools without ever jointly training on them. Our key message is that data curation $\textit{cannot}$ be agnostic of the total compute that a model will be trained for. Our scaling laws allow us to curate the best possible pool for achieving top performance on Datacomp at various compute budgets, carving out a pareto-frontier for data curation. Code is available at https://github.com/locuslab/scaling_laws_data_filtering.
- Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Revisiting neural scaling laws in language and vision, 2022.
- Explaining neural scaling laws, 2021.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems (NeurIPS), pages 9453–9463, 2019.
- Improving image generation with better captions. URL https://api.semanticscholar.org/CorpusID:264403242.
- Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
- Broken neural scaling laws, 2023.
- Microsoft coco captions: Data collection and evaluation server, 2015.
- Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, Oct 2017. ISSN 1558-2256. doi: 10.1109/jproc.2017.2675998. URL http://dx.doi.org/10.1109/JPROC.2017.2675998.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
- The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
- Data filtering networks, 2023.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004.
- Datacomp: In search of the next generation of multimodal datasets, 2023a.
- Datacomp: In search of the next generation of multimodal datasets, 2023b.
- Language models scale reliably with over-training and on downstream tasks, 2024.
- Tatsunori Hashimoto. Model performance scaling with multiple data sources. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4107–4116. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/hashimoto21a.html.
- Natural adversarial examples. arXiv preprint arXiv:1907.07174, 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. arXiv preprint arXiv:2006.16241, 2020.
- Scaling laws for transfer, 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Marcus Hutter. Learning curve theory, 2021.
- Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
- Scaling laws for downstream task performance of large language models, 2024.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Scaling laws for neural language models, 2020.
- 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023a.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023b.
- Sieve: Multimodal dataset pruning using image captioning models, 2023.
- T-mars: Improving visual representations by circumventing text feature learning. arXiv preprint arXiv:2307.03132, 2023.
- Rephrasing the web: A recipe for compute and data-efficient language modeling. arXiv preprint arXiv:2401.16380, 2024.
- When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
- Scaling data-constrained language models. arXiv preprint arXiv:2305.16264, 2023.
- Improving multimodal datasets with image captioning, 2023.
- M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
- Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
- Filtering, distillation, and hard negatives for vision-language pre-training. arXiv:2301.02280, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), volume 139, pages 8748–8763, 2021a.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning (ICML), 2019.
- Imagenet large scale visual recognition challenge, 2015.
- Beyond chinchilla-optimal: Accounting for inference in language model scaling laws, 2023.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Robust fine-tuning of zero-shot models. arXiv preprint arXiv:2109.01903, 2021.
- Cit: Curation in training for effective vision-language data. arXiv preprint arXiv:2301.02241, 2023.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. doi: 10.1162/tacl_a_00166. URL https://aclanthology.org/Q14-1006.
- The devil is in the details: A deep dive into the rabbit hole of data filtering, 2023.
- A large-scale study of representation learning with the visual task adaptation benchmark, 2020.