Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Negative Result on Gradient Matching for Selective Backprop (2312.05021v1)

Published 8 Dec 2023 in cs.LG, cs.AI, and math.OC

Abstract: With increasing scale in model and dataset size, the training of deep neural networks becomes a massive computational burden. One approach to speed up the training process is Selective Backprop. For this approach, we perform a forward pass to obtain a loss value for each data point in a minibatch. The backward pass is then restricted to a subset of that minibatch, prioritizing high-loss examples. We build on this approach, but seek to improve the subset selection mechanism by choosing the (weighted) subset which best matches the mean gradient over the entire minibatch. We use the gradients w.r.t. the model's last layer as a cheap proxy, resulting in virtually no overhead in addition to the forward pass. At the same time, for our experiments we add a simple random selection baseline which has been absent from prior work. Surprisingly, we find that both the loss-based as well as the gradient-matching strategy fail to consistently outperform the random baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481, 2015.
  2. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.
  3. Importance sampling for minibatches. The Journal of Machine Learning Research, 19(1):962–982, 2018.
  4. DeepSpeed. Flop measurement in the DeepSpeed profiler. https://www.deepspeed.ai/tutorials/flops-profiler/#flops-measurement, 2023.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  6. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  7. Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762, 2019.
  8. No train no gain: Revisiting efficient training algorithms for transformer-based language models, 2023.
  9. Biased importance sampling for deep neural network training. arXiv preprint arXiv:1706.00043, 2017.
  10. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pp. 5464–5474. PMLR, 2021.
  11. Learning multiple layers of features from tiny images. 2009.
  12. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
  13. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  14. Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41(12):3397–3415, 1993.
  15. Reading digits in natural images with unsupervised feature learning. 2011.
  16. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
  17. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  18. Efficient implementation of the k-svd algorithm using batch orthogonal matching pursuit. Technical report, Computer Science Department, Technion, 2008.
  19. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  20. Determinantal point processes for mini-batch diversification. arXiv preprint arXiv:1705.00607, 2017.

Summary

We haven't generated a summary for this paper yet.