Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding (2312.05328v4)
Abstract: Power-law scaling indicates that large-scale training with uniform sampling is prohibitively slow. Active learning methods aim to increase data efficiency by prioritizing learning on the most relevant examples. Despite their appeal, these methods have yet to be widely adopted since no one algorithm has been shown to a) generalize across models and tasks b) scale to large datasets and c) yield overall FLOP savings when accounting for the overhead of data selection. In this work we propose a method which satisfies these three properties, leveraging small, cheap proxy models to estimate "learnability" scores for datapoints, which are used to prioritize data for the training of much larger models. As a result, our models require 46% and 51% fewer training updates and up to 25% less total computation to reach the same performance as uniformly trained visual classifiers on JFT and multimodal models on ALIGN. Finally, we find our data-prioritization scheme to be complementary with recent data-curation and learning objectives, yielding a new state-of-the-art in several multimodal transfer tasks.
- Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
- Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems, 2022.
- A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017.
- The deepmind jax ecosystem, 2020. URL http://github. com/deepmind, 2010.
- JAX: composable transformations of Python+NumPy programs, 2018.
- Reverb: A framework for experience replay, 2021.
- Active bias: Training more accurate neural networks by emphasizing high variance samples. Advances in Neural Information Processing Systems, 30, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pages 1407–1416. PMLR, 2018.
- Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
- Automated curriculum learning for neural networks. In international conference on machine learning, pages 1311–1320. Pmlr, 2017.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
- Training compute-optimal large language models. In Advances in Neural Information Processing Systems, 2022.
- Meta-learning online adaptation of language models. arXiv preprint arXiv:2305.15076, 2023.
- Openclip, 2021. If you use this software, please cite it as below.
- Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762, 2019.
- In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture, pages 1–12, 2017.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pages 2525–2534. PMLR, 2018.
- Microsoft coco: Common objects in context. In Eur. Conf. Comput. Vis., pages 740–755. Springer, 2014.
- Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4):986–1005, 1956.
- Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- David JC MacKay. Information-based objective functions for active data selection. Neural computation, 4(4):590–604, 1992.
- Sieve: Multimodal dataset pruning using image captioning models. arXiv preprint arXiv:2310.02110, 2023.
- When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
- Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630–15649. PMLR, 2022.
- Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
- Discovering objects and their relations from entangled scene representations. arXiv preprint arXiv:1702.05068, 2017.
- Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Burr Settles. Active learning literature survey. 2009.
- Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
- Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Efficient meta-learning via error-based context pruning for implicit neural representations. arXiv preprint arXiv:2302.00617, 2023.
- An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159, 2018.
- Doremi: Optimizing data mixtures speeds up language model pretraining, 2023.
- Launchpad: A programming model for distributed machine learning research. arXiv preprint arXiv:2106.04516, 2021.
- CoCa: Contrastive captioners are image-text foundation models. In Transactions on Machine Learning Research, 2022.
- Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.