Soft Random Sampling: A Theoretical and Empirical Analysis (2311.12727v2)
Abstract: Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in each epoch. In this paper, we conduct a theoretical and empirical analysis of SRS. First, we analyze its sampling dynamics including data coverage and occupancy. Next, we investigate its convergence with non-convex objective functions and give the convergence rate. Finally, we provide its generalization performance. We empirically evaluate SRS for image recognition on CIFAR10 and automatic speech recognition on Librispeech and an in-house payload dataset to demonstrate its effectiveness. Compared to existing coreset-based data selection methods, SRS offers a better accuracy-efficiency trade-off. Especially on real-world industrial scale data sets, it is shown to be a powerful training strategy with significant speedup and competitive performance with almost no additional computing cost.
- “Deep learning,” Nature, pp. 436–444, May 2015.
- “Mastering the game of Go with deep neural networks and tree search,” Nature, pp. 484–489, January 2016.
- “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, pp. 115–118, Feburary 2017.
- “English conversational telephone speech recognition by humans and machines,” in Interspeech, 2017, pp. 132–136.
- “Toward human parity in conversational speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2410–2423, 2017.
- “Language models are few-shot learners,” in Advances in Neural Information Processing Systems (NurIPS), 2020, pp. 1877–1901.
- “Realizing petabyte scale acoustic modeling,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 422–432, 2019.
- “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
- “Coresets for data-efficient training of machine learning models,” in International Conference on Machine Learning (ICML), 2020, pp. 6950–6960.
- “Data summarization via bilevel optimization,” arXiv preprint arXiv:2109.12534, 2021.
- Daniel R. Kowal, “Bayesian subset selection and variable importance for interpretable prediction and classification,” Journal of Machine Learning Research, vol. 23, no. 108, pp. 1–38, 2022.
- “DeepCore: A comprehensive library for coreset selection in deep learning,” arXiv preprint arXiv:2204.08499, 2022.
- “Communication efficient coresets for empirical loss minimization,” in Conference on Uncertainty in Artificial Intelligence (UAI), 2015, p. 752–761.
- “Active learning for automatic speech recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2002, pp. 3904–3907.
- “Active learning for spoken language understanding,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2003, pp. 276–279.
- “Learning from less data: a unified data subset selection and active learning framework for computer vision,” in IEEE Winter Conference on Applications of Computer Vision, 2019, pp. 1289–1299.
- “Selection via proxy: efficient data selection for deep learning,” in International Conference on Learning Representations (ICLR), 2020.
- “A class of submodular functions for document summarization,” in The Association for Computational Linguistics/Human Language Technology Conference (ACL/HLT), 2011, pp. 510–520.
- “Using document summarization techniques for speech data subset selection,” in North American Chapter of the Association for Computational Linguistics/Human Language Technology Conference (NAACL/HLT), 2013, pp. 721–726.
- “Data selection for speech recognition,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2007.
- “Active learning and semi-supervised learning for speech recognition: A unified framework using global entropy reduction maximization criterion,” Computer Speech and Language, vol. 24, pp. 433–444, 2009.
- “Training data subset selection for regression with controlled generalization error,” in International Conference on Machine Learning (ICML), 2021, pp. 9202–9212.
- “GRAD-MATCH: gradient matching based data subset selection for efficient deep model training,” in International Conference on Machine Learning (ICML), 2021, pp. 5464–5474.
- “Submodularity in data subset selection and active learning,” in International Conference on Machine Learning (ICML), 2015, pp. 1954–1963.
- “Unsupervised submodular subset selection for speech data,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4107–4111.
- “Submodular subset selection for large-scale speech training data,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3311–3315.
- “Submodularity for data selection in statistical machine learning,” in Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 131–141.
- “GLISTER: generalization based data subset selection for efficient and robust learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 8110–8118.
- Satoru Fujishige, Submodular functions and optimization, Elsevier, 1991.
- “Coresets for robust training of neural networks against noisy labels,” in Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 11465–11477.
- “SVP-CF: selection via proxy for collaborative filtering data,” arXiv preprint arXiv:2109.12534, 2021.
- “Adaptive second order coresets for data-efficient machine learning,” in International Conference on Machine Learning (ICML), 2022, pp. 17848–17869.
- “How to select a good training-data subset for transcription: Submodular active selection for sequences,” in Interspeech, 2009, pp. 510–520.
- “Distributed submodular cover: Succinctly summarizing massive data,” in Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 2881–2889.
- “Optimization methods for large-scale machine learning,” arXiv preprint arXiv:1606.04838, 2016.
- “The tradeoffs of large scale learning,” in Advances in Neural Information Processing Systems (NIPS), 2007, pp. 161–168.
- “ADAM: a method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
- Yu. E. Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence o(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ),” Soviet Math Docl, vol. 269, pp. 543–547, 1983.
- Probability and Computing: Randomized Algorithms and Probabilistic Analysis, Cambridge University Press, 2005.
- Wolfgang Stadje, “The collector’s problem with group drawing,” Advances in Applied Probability, vol. 22, pp. 866–882, 1990.
- Lars Holst, “On birthday, collectors’, occupancy and other classical urn problems,” International Statistical Review, vol. 54, no. 1, pp. 15–27, 1986.
- Urn Models and Their Application, Wiley, New York, 1977.
- Combinatorial Chance, Hafner Publishing Co., New York, 1962.
- “The coupon collector’s problem,” MATerials MATemàtics, vol. 2014, pp. 35, 2014.
- “Neural tangent kernel: Convergence and generalization in neural networks,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.
- “Loss landscapes and optimization in over-parameterized non-linear systems and neural networks,” Applied and Computational Harmonic Analysis, vol. 59, pp. 85–116, 2022.
- Foundations of machine learning, MIT press, 2018.
- “Learning multiple layers of features from tiny images,” Computer Science Department, University of Toronto, Tech. Rep, vol. 1, no. 4, pp. 7, 2009.
- “Deep residual learning for image recognition,” in Conference on Computer Vision and Pattern Recognition (CVPR’15), 2015, pp. 770–778.
- “Librispeech: an ASR corpus based on public domain audio books,” in 2015 IEEE international Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
- Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
- “Speechbrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
- “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.