Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Effective pruning of web-scale datasets based on complexity of concept clusters (2401.04578v2)

Published 9 Jan 2024 in cs.CV

Abstract: Utilizing massive web-scale datasets has led to unprecedented performance gains in machine learning models, but also imposes outlandish compute requirements for their training. In order to improve training and data efficiency, we here push the limits of pruning large-scale multimodal datasets for training CLIP-style models. Today's most effective pruning method on ImageNet clusters data samples into separate concepts according to their embedding and prunes away the most prototypical samples. We scale this approach to LAION and improve it by noting that the pruning rate should be concept-specific and adapted to the complexity of the concept. Using a simple and intuitive complexity measure, we are able to reduce the training cost to a quarter of regular training. By filtering from the LAION dataset, we find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs. More specifically, we are able to outperform the LAION-trained OpenCLIP-ViT-B32 model on ImageNet zero-shot accuracy by 1.1p.p. while only using 27.7% of the data and training compute. Despite a strong reduction in training cost, we also see improvements on ImageNet dist. shifts, retrieval tasks and VTAB. On the DataComp Medium benchmark, we achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.

Introduction

Advancements in artificial intelligence, particularly in the realms of machine learning and its application to large-scale multimodal datasets, have led to significant improvements in model performance. However, it's also important to consider the compute requirements and environmental costs associated with training these increasingly complex models. Building on the efficiency of data usage, recent research has focused on refining dataset pruning—a process that selects a subset of the original dataset for training—to significantly reduce computational costs while maintaining, or even enhancing, model performance.

Data Efficiency and Pruning

In the context of large-scale datasets such as LAION, which can contain billions of examples, identifying and removing redundant or less informative data can accelerate the learning process and enhance data efficiency. Traditional methods of pruning involve a process called Self-Supervised-Prototypes Pruning (SSP-Pruning), where clusters of data samples are formed and the most prototypical examples—those closest to the cluster centers—are discarded. However, recent innovations propose a more nuanced pruning method that takes into account the complexity of the data within the clusters, leading to more effective pruning by adapting the rate at which data is discarded based on the cluster's complexity.

Research Contributions and Methodology

The researchers present several significant contributions. They scale SSP-Pruning to web-scale datasets and implement a novel pruning criterion influenced by concept complexity within these datasets. When compared with previous methods, their approach demonstrates superior performance on various benchmarks while reducing training computational costs by a significant margin. For instance, their model exceeds the LAION-trained OpenCLIP-ViT-B/32 model in zero-shot accuracy by 1.1 percentage points while only using 27.7% of the data and compute.

Central to their methodology is a new technique called Density-Based Pruning (DBP), which strategically selects a smaller yet high-quality subset of data from a web-scale dataset. DBP considers the intricacies of clusters by evaluating the average intra-cluster distance—the variation within a cluster—and the inter-cluster distance—the spatial relation between clusters. The result is a pruned dataset that better captures the diversity and complexities of the original data, leading to more balanced and efficient training.

Experiments and Results

The team's extensive experiments further validate their approach. The pruning process involves deduplication, CLIP-score filtering, which scores image and text pair compatibility, and finally, the DBP method that selects the data subset. They show that by applying this innovative pruning strategy to the LAION-CAT-440M dataset, and thus creating smaller, curated subsets, their models outperform the existing baselines on the ImageNet benchmark using just a fraction of the original computational cost. Additionally, state-of-the-art results were achieved on the DataComp Medium benchmark, which categorizes it at the forefront of pruning methods.

Conclusion

The research highlights the effectiveness of intelligent dataset pruning in improving the efficiency of model training processes. By utilizing DBP, models can be trained to achieve superior performance on complex tasks using significantly smaller datasets. This reduction in computational overhead makes it feasible for more researchers, including those in academic settings with limited resources, to engage in state-of-the-art AI research. The research paves the way for more sustainable and accessible AI development, with a particular focus on optimal data usage and cost reduction while maximizing model performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  2. Throwing away data improves worst-class error in imbalanced classification. In ICML, 2023.
  3. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE Transactions on Medical Imaging, 2018. https://pubmed.ncbi.nlm.nih.gov/30716025/.
  4. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems (NeurIPS), volume 32. Curran Associates, Inc., 2019. https://proceedings.neurips.cc/paper/2019/file/97af07a14cacba681feacf3012730892-Paper.pdf.
  5. The iwildcam 2020 competition dataset, 2020. https://arxiv.org/abs/2004.10340.
  6. WinoGAViL: Gamified association benchmark to challenge vision-and-language models, 2022. https://arxiv.org/abs/2207.12576.
  7. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  8. Food-101–mining discriminative components with random forests. In European Conference on Computer Vision (ECCV), 2014. https://link.springer.com/chapter/10.1007/978-3-319-10599-4_29.
  9. qpsolvers: Quadratic Programming Solvers in Python, April 2023. URL https://github.com/qpsolvers/qpsolvers.
  10. Microsoft COCO captions: Data collection and evaluation server, 2015. https://arxiv.org/abs/1504.00325.
  11. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the Institute of Electrical and Electronics Engineers (IEEE), 2017. https://ieeexplore.ieee.org/abstract/document/7891544.
  12. Functional map of the world. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018. https://arxiv.org/abs/1711.07846.
  13. Describing textures in the wild. In Conference on Computer Vision and Pattern Recognition (CVPR), 2014. https://openaccess.thecvf.com/content_cvpr_2014/html/Cimpoi_Describing_Textures_in_2014_CVPR_paper.html.
  14. An analysis of single-layer networks in unsupervised feature learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2011. https://proceedings.mlr.press/v15/coates11a.html.
  15. Selection via proxy: Efficient data selection for deep learning. arXiv preprint arXiv:1906.11829, 2019.
  16. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  17. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  18. On robustness and transferability of convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16458–16468, 2021.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  20. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results, 2007. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
  21. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19358–19369, 2023.
  22. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, 2004. https://ieeexplore.ieee.org/document/1384978.
  23. Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp.  954–959, 2020.
  24. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
  25. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  26. Deep bayesian active learning with image data. In International conference on machine learning, pp. 1183–1192. PMLR, 2017.
  27. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. https://ieeexplore.ieee.org/abstract/document/6248074.
  28. Large-scale dataset pruning with dynamic uncertainty. arXiv preprint arXiv:2306.05175, 2023.
  29. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019. https://arxiv.org/abs/1709.00029.
  30. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021a. https://arxiv.org/abs/2006.16241.
  31. Natural adversarial examples. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021b. https://arxiv.org/abs/1907.07174.
  32. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  33. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp. 4904–4916. PMLR, 2021.
  34. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  35. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. Conference on Computer Vision and Pattern Recognition (CVPR), 2017. https://arxiv.org/abs/1612.06890.
  36. Retrieve: Coreset selection for efficient and robust semi-supervised learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  14488–14501. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/793bc52a941b3951dfdb85fb04f9fd06-Paper.pdf.
  37. WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning (ICML), 2021. https://arxiv.org/abs/2012.07421.
  38. Big transfer (bit): General visual representation learning. In Computer Vision – ECCV 2020, 2020.
  39. 3d object representations for fine-grained categorization. In International Conference on Computer Vision Workshops (ICML), 2013. https://www.cv-foundation.org/openaccess/content_iccv_workshops_2013/W19/html/Krause_3D_Object_Representations_2013_ICCV_paper.html.
  40. Learning multiple layers of features from tiny images, 2009. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
  41. Yann LeCun. The MNIST database of handwritten digits, 1998. http://yann.lecun.com/exdb/mnist/.
  42. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp. 12888–12900. PMLR, 2022a.
  43. An inverse scaling law for clip training. arXiv preprint arXiv:2305.07017, 2023.
  44. Stylet2i: Toward compositional and high-fidelity text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18197–18207, 2022b.
  45. T-mars: Improving visual representations by circumventing text feature learning. arXiv preprint arXiv:2307.03132, 2023.
  46. Fine-grained visual classification of aircraft, 2013. https://arxiv.org/abs/1306.5151.
  47. Torchvision the machine-vision package of torch. In ACM International Conference on Multimedia, 2010.
  48. Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing Systems (NeurIPS) Workshops, 2011. https://storage.googleapis.com/pub-tools-public-publication-data/pdf/37648.pdf.
  49. Improving multimodal datasets with image captioning. arXiv preprint arXiv:2307.10350, 2023.
  50. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, 2008. https://ieeexplore.ieee.org/document/4756141.
  51. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  52. Cats and dogs. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. https://ieeexplore.ieee.org/document/6248092.
  53. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
  54. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
  55. Filtering, distillation, and hard negatives for vision-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6967–6977, 2023.
  56. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021a.
  57. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021b.
  58. Beyond web-scraping: Crowd-sourcing a geodiverse datase, 2023. https://arxiv.org/abs/2301.02560.
  59. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  60. Do ImageNet classifiers generalize to ImageNet? In International Conference on Machine Learning (ICML), 2019. http://proceedings.mlr.press/v97/recht19a.html.
  61. The dollar street dataset: Images representing the geographic and socioeconomic diversity of the world. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2022. https://openreview.net/forum?id=qnfYsave0U4.
  62. Is a caption worth a thousand images? a controlled study for representation learning. arXiv preprint arXiv:2207.07635, 2022.
  63. LAION-400M: Open dataset of clip-filtered 400 million image-text pairs, 2021a. https://arxiv.org/abs/2111.02114.
  64. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021b.
  65. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vmsMcY.
  66. Beyond neural scaling laws: beating power law scaling via data pruning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  19523–19536. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/7b75da9b61eda40fa35453ee5d077df6-Paper-Conference.pdf.
  67. The german traffic sign recognition benchmark: a multi-class classification competition. In International Joint Conference on Neural Networks (IJCNN), 2011. https://ieeexplore.ieee.org/document/6033395.
  68. O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47, 2011. URL http://www.gnu.org/s/parallel.
  69. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33:18583–18599, 2020.
  70. YFCC100M: The new data in multimedia research. Communications of the ACM, 2016. https://arxiv.org/abs/1503.01817.
  71. D4: Improving llm pretraining via document de-duplication and diversification. arXiv preprint arXiv:2308.12284, 2023.
  72. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159, 2018.
  73. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  74. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  75. Rotation equivariant CNNs for digital pathology, 2018. https://arxiv.org/abs/1806.03962.
  76. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2.
  77. Too large; data reduction for vision-language pre-training. In ICCV, 2023.
  78. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS), 2019. https://arxiv.org/abs/1905.13549.
  79. On the de-duplication of laion-2b, 2023.
  80. Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In The Eleventh International Conference on Learning Representations, 2022.
  81. Sun database: Exploring a large collection of scene categories. International Journal of Computer Vision (IJCV), 2016. https://link.springer.com/article/10.1007/s11263-014-0748-y.
  82. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
  83. Cit: Curation in training for effective vision-language data. arXiv preprint arXiv:2301.02241, 2023a.
  84. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2955–2966, 2023b.
  85. Slurm: Simple linux utility for resource management. In Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (eds.), Job Scheduling Strategies for Parallel Processing. Springer Berlin Heidelberg, 2003.
  86. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2014. https://aclanthology.org/Q14-1006/.
  87. The visual task adaptation benchmark, 2019a. http://arxiv.org/abs/1910.04867.
  88. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019b.
  89. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  12104–12113, June 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Amro Abbas (6 papers)
  2. Evgenia Rusak (10 papers)
  3. Kushal Tirumala (17 papers)
  4. Wieland Brendel (55 papers)
  5. Kamalika Chaudhuri (121 papers)
  6. Ari S. Morcos (31 papers)
Citations (14)