Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding (2312.05328v4)

Published 8 Dec 2023 in cs.AI

Abstract: Power-law scaling indicates that large-scale training with uniform sampling is prohibitively slow. Active learning methods aim to increase data efficiency by prioritizing learning on the most relevant examples. Despite their appeal, these methods have yet to be widely adopted since no one algorithm has been shown to a) generalize across models and tasks b) scale to large datasets and c) yield overall FLOP savings when accounting for the overhead of data selection. In this work we propose a method which satisfies these three properties, leveraging small, cheap proxy models to estimate "learnability" scores for datapoints, which are used to prioritize data for the training of much larger models. As a result, our models require 46% and 51% fewer training updates and up to 25% less total computation to reach the same performance as uniformly trained visual classifiers on JFT and multimodal models on ALIGN. Finally, we find our data-prioritization scheme to be complementary with recent data-curation and learning objectives, yielding a new state-of-the-art in several multimodal transfer tasks.

Active Learning Accelerates Large-Scale Visual Understanding

The paper penned by Evans et al. from Google DeepMind proposes and validates a method to enhance the efficiency of training large-scale visual models via active learning. This approach becomes particularly relevant in the context of power-law scaling, where massive datasets require immense computational resources but yield only incremental performance improvements. The authors address a critical gap by developing a process that is generalizable across models and tasks, scalable to large datasets, and offers net computational savings, a trifecta not previously achieved by active learning methods.

Overview of Methodology

The authors introduce an approach leveraging small proxy models to compute "learnability" scores, which dictate data prioritization during the training phase of larger models. This stands in contrast to the traditional uniform sampling methodologies, bringing to the fore an active data selection process that evaluates and prioritizes data points based on their utility for learning. Small proxy models assess each data point's relevance via learnability scores, informed by 'difficulty' and 'learnability'—core concepts that are operationalized by scoring based on the loss under the current and a reference model.

Key Results

The paper presents compelling numerical evidence, reporting that the proposed method necessitated 46% to 51% fewer training updates and up to a 25% reduction in total computational demand compared to baseline uniformly trained visual classifiers on JFT and multimodal models in ALIGN. This indicates substantial improvements in data efficiency and computational cost mitigation. Moreover, the data-prioritization scheme has synergistic effects when used in conjunction with modern data-curation techniques, securing new state-of-the-art results on several multimodal transfer tasks.

Implications and Future Directions

The findings of this research are significant both in practical and theoretical contexts. Practically, the reduction in computational cost without compromising model performance aligns with the growing demand for resource-efficient machine learning solutions. Theoretically, this research lends credence to the potential of dynamic data selection methods to circumvent the limitations imposed by power-law scaling. Future research paths might explore the applicability of these active learning frameworks to other domains, such as LLMs and multi-modal architectures that integrate additional data modalities like audio or video. The generalizability of these models to dynamically evolving datasets in practical settings could also form an interesting trajectory for further work.

In conclusion, this paper offers a robust method for enhancing the computational efficiency of large-scale model training via the strategic prioritization of data facilitated by learnability scores, fostering advancements in the persistent challenge of data and computational resource-intensive AI model training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  2. Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems, 2022.
  3. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017.
  4. The deepmind jax ecosystem, 2020. URL http://github. com/deepmind, 2010.
  5. JAX: composable transformations of Python+NumPy programs, 2018.
  6. Reverb: A framework for experience replay, 2021.
  7. Active bias: Training more accurate neural networks by emphasizing high variance samples. Advances in Neural Information Processing Systems, 30, 2017.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  9. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pages 1407–1416. PMLR, 2018.
  10. Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
  11. Automated curriculum learning for neural networks. In international conference on machine learning, pages 1311–1320. Pmlr, 2017.
  12. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  13. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  14. Training compute-optimal large language models. In Advances in Neural Information Processing Systems, 2022.
  15. Meta-learning online adaptation of language models. arXiv preprint arXiv:2305.15076, 2023.
  16. Openclip, 2021. If you use this software, please cite it as below.
  17. Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762, 2019.
  18. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture, pages 1–12, 2017.
  19. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  20. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pages 2525–2534. PMLR, 2018.
  21. Microsoft coco: Common objects in context. In Eur. Conf. Comput. Vis., pages 740–755. Springer, 2014.
  22. Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4):986–1005, 1956.
  23. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
  24. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  25. David JC MacKay. Information-based objective functions for active data selection. Neural computation, 4(4):590–604, 1992.
  26. Sieve: Multimodal dataset pruning using image captioning models. arXiv preprint arXiv:2310.02110, 2023.
  27. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
  28. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630–15649. PMLR, 2022.
  29. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
  30. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  31. Discovering objects and their relations from entangled scene representations. arXiv preprint arXiv:1702.05068, 2017.
  32. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
  33. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  34. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  35. Burr Settles. Active learning literature survey. 2009.
  36. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  37. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017.
  38. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  39. Efficient meta-learning via error-based context pruning for implicit neural representations. arXiv preprint arXiv:2302.00617, 2023.
  40. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159, 2018.
  41. Doremi: Optimizing data mixtures speeds up language model pretraining, 2023.
  42. Launchpad: A programming model for distributed machine learning research. arXiv preprint arXiv:2106.04516, 2021.
  43. CoCa: Contrastive captioners are image-text foundation models. In Transactions on Machine Learning Research, 2022.
  44. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Talfan Evans (6 papers)
  2. Shreya Pathak (12 papers)
  3. Hamza Merzic (10 papers)
  4. Jonathan Schwarz (12 papers)
  5. Ryutaro Tanno (36 papers)
  6. Olivier J. Henaff (20 papers)
Citations (8)
Youtube Logo Streamline Icon: https://streamlinehq.com