Papers
Topics
Authors
Recent
2000 character limit reached

Dataset Growth (2405.18347v2)

Published 28 May 2024 in cs.LG

Abstract: Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. However, these methods are mainly proposed for offline settings that target one of the cleanness and redundancy problems. In practice, data are growing exponentially with both problems. This leads to repeated data curation with sub-optimal efficiency. To tackle this challenge, we propose InfoGrowth, an efficient online algorithm for data cleaning and selection, resulting in a growing dataset that keeps up to date with awareness of cleanliness and diversity. InfoGrowth can improve data quality/efficiency on both single-modal and multi-modal tasks, with an efficient and scalable design. Its framework makes it practical for real-world data engines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Common crawl.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
  3. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. CoRR, abs/2102.08981, 2021.
  4. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  5. The faiss library. 2024.
  6. Fabio Duarte. Amount of data created daily (2024), Dec 2023.
  7. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017.
  8. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
  9. Submodular combinatorial information measures with applications in machine learning. In Algorithmic Learning Theory, pages 722–754. PMLR, 2021.
  10. Segment anything. arXiv:2304.02643, 2023.
  11. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
  12. Cifar-10 (canadian institute for advanced research).
  13. MNIST handwritten digit database. 2010.
  14. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  15. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR, 17–23 Jul 2022.
  16. Align before fuse: Vision and language representation learning with momentum distillation. CoRR, abs/2107.07651, 2021.
  17. Dividemix: Learning with noisy labels as semi-supervised learning, 2020.
  18. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  19. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. CoRR, abs/1603.09320, 2016.
  20. Im2text: Describing images using 1 million captioned photographs. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
  21. Instruction tuning with gpt-4, 2023.
  22. Infobatch: Lossless training speed up by unbiased dynamic data pruning. In The Twelfth International Conference on Learning Representations, 2024.
  23. Learning transferable visual models from natural language supervision, 2021.
  24. Accelerating deep learning with dynamic data pruning, 2021.
  25. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  26. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, July 2018. Association for Computational Linguistics.
  27. A corpus for reasoning about natural language grounded in photographs, 2019.
  28. Revisiting unreasonable effectiveness of data in deep learning era. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 843–852, 2017.
  29. Yfcc100m: the new data in multimedia research. Communications of the ACM, 59(2):64–73, January 2016.
  30. An empirical study of example forgetting during deep neural network learning, 2018.
  31. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  32. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 5 likes about this paper.