Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multiple importance sampling for stochastic gradient estimation (2407.15525v1)

Published 22 Jul 2024 in cs.LG and stat.ML

Abstract: We introduce a theoretical and practical framework for efficient importance sampling of mini-batch samples for gradient estimation from single and multiple probability distributions. To handle noisy gradients, our framework dynamically evolves the importance distribution during training by utilizing a self-adaptive metric. Our framework combines multiple, diverse sampling distributions, each tailored to specific parameter gradients. This approach facilitates the importance sampling of vector-valued gradient estimation. Rather than naively combining multiple distributions, our framework involves optimally weighting data contribution across multiple distributions. This adapted combination of multiple importance yields superior gradient estimates, leading to faster training convergence. We demonstrate the effectiveness of our approach through empirical evaluations across a range of optimization tasks like classification and regression on both image and point cloud datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481, 2015.
  2. Adaptive learning of the optimal batch size of sgd, 2021.
  3. Coupling adaptive batch sizes with learning rates. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), pp.  ID 141, August 2017. URL http://auai.org/uai2017/proceedings/papers/141.pdf.
  4. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6(54):1579–1619, 2005. URL http://jmlr.org/papers/v6/bordes05a.html.
  5. Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012.
  6. One backward from ten forward, subsampling for large-scale deep learning. arXiv preprint arXiv:2104.13114, 2021.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  8. A study of gradient variance in deep learning. arXiv preprint arXiv:2007.04532, 2020.
  9. Variance-aware multiple importance sampling. ACM Trans. Graph., 38(6), nov 2019. ISSN 0730-0301. doi: 10.1145/3355089.3356515. URL https://doi.org/10.1145/3355089.3356515.
  10. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  11. Kahn, H. Random sampling (monte carlo) techniques in neutron attenuation problems–i. Nucleonics, 6(5):27, passim, 1950.
  12. Methods of reducing sample size in monte carlo computations. Journal of the Operations Research Society of America, 1(5):263–278, 1953.
  13. Biased importance sampling for deep neural network training. ArXiv, abs/1706.00043, 2017. URL https://api.semanticscholar.org/CorpusID:38367260.
  14. Not all samples are created equal: Deep learning with importance sampling. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  2525–2534. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/katharopoulos18a.html.
  15. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  16. Optimal multiple importance sampling. ACM Transactions on Graphics (TOG), 38(4):37, 2019.
  17. Learning multiple layers of features from tiny images. Technical report, Toronto, ON, Canada, 2009.
  18. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  19. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
  20. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/f29c21d4897f78948b91f03172341b7b-Paper.pdf.
  21. Safe and effective importance sampling. Journal of the American Statistical Association, 95(449):135–143, 2000. doi: 10.1080/01621459.2000.10473909. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.2000.10473909.
  22. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  652–660, 2017.
  23. Adaptive antithetic sampling for variance reduction. In International Conference on Machine Learning, pp.  5420–5428. PMLR, 2019.
  24. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
  25. Low: Training deep neural networks by learning optimal sample weights. Pattern Recognition, 110:107585, 2021.
  26. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
  27. Implicit neural representations with periodic activation functions. Advances in neural information processing systems, 33:7462–7473, 2020.
  28. Veach, E. Robust Monte Carlo methods for light transport simulation, volume 1610. Stanford University PhD thesis, 1997.
  29. Accelerating deep neural network training with inconsistent stochastic gradient descent. Neural Networks, 93:219–229, 2017.
  30. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1912–1920, 2015.
  31. Active mini-batch sampling using repulsive point processes. In Proceedings of the AAAI conference on Artificial Intelligence, volume 33, pp.  5741–5748, 2019.
  32. Adaselection: Accelerating deep learning training through data subsampling. arXiv preprint arXiv:2306.10728, 2023.
  33. Stochastic optimization with importance sampling for regularized loss minimization. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp.  1–9, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/zhaoa15.html.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Corentin Salaün (6 papers)
  2. Xingchang Huang (7 papers)
  3. Iliyan Georgiev (21 papers)
  4. Niloy J. Mitra (83 papers)
  5. Gurprit Singh (18 papers)

Summary

Multiple Importance Sampling for Stochastic Gradient Estimation: A Technical Overview

The paper "Multiple importance sampling for stochastic gradient estimation" introduces a novel technique designed to enhance the efficiency and accuracy of gradient estimation in optimization tasks by leveraging importance sampling (IS) and multiple importance sampling (MIS) methodologies.

Introduction

The necessity for accurate and efficient gradient estimation remains a key challenge in the stochastic gradient descent (SGD) approach due to inherent stochasticity, which introduces noise into gradient calculations. Traditional methods to mitigate noise include adaptive mini-batch sizing, momentum-based techniques, and conventional importance sampling. This paper extends the latter by proposing an evolved framework integrating MIS for vector-valued gradient estimation, which combines multiple sampling distributions.

Key Contributions

The contributions of the paper can be summarized as follows:

  1. Efficient IS Algorithm: The authors present an IS algorithm that evolves dynamically through training by employing a self-adaptive metric. This reduces the overhead common in existing IS methods.
  2. MIS Estimator for Vector-Valued Gradients: Introduction of an MIS estimator suitable for vector-valued gradient estimation, a stark departure from traditional scalar-based gradients.
  3. Optimal Weight Computation: Practical approach for computing weights to maximize gradient estimation quality using principles from optimal MIS (OMIS).
  4. Empirical Validation: Extensive empirical evaluations demonstrate the superior performance of the proposed methods over traditional SGD and other IS methods like DLIS.

Methodology

Mini-Batch IS

The proposed mini-batch IS algorithm (Algorithm 1) maintains and updates a set of per-sample importance values dynamically during training. The importance function is derived from the output layer gradients, thus providing a computationally efficient approximation without needing additional forward passes for each sample.

MIS for Vector-Valued Gradients

MIS is adapted for vector-valued gradient estimation by combining multiple importance sampling distributions. The estimator (Eq. 10) weights data contributions from multiple distributions proportionally to their utility, calculated via OMIS. This optimization theoretically reduces estimation variance and speeds up convergence compared to single-distribution IS.

Practical Algorithmic Implementation

The practical implementation of OMIS (Algorithm 3) involves sampling from multiple distributions and solving a linear system to compute optimal weights. Momentum-based accumulation of the linear system components ensures stability and efficacy, even with a limited number of samples per distribution.

Experimental Results

The experiments demonstrate the effectiveness of the proposed methods across various tasks:

  • Polynomial Regression: The convergence of the exact gradient is matched by OMIS using significantly fewer samples per mini-batch compared to classical SGD.
  • Classification Tasks: On datasets like MNIST, CIFAR-10, and CIFAR-100, the proposed IS and OMIS methods achieve comparable or superior classification accuracy and loss reduction compared to DLIS and other baselines. Equal-time evaluations reveal a computational advantage due to lower overhead.
  • Point Cloud Classification: OMIS significantly outperforms other methods in classification accuracy, demonstrating the utility of tailored vector-valued gradient estimation.
  • Image Regression: The OMIS method outperforms other techniques in terms of image fidelity and loss, as depicted in visual results on a 2D image regression task.

Implications and Future Work

This research suggests substantial practical and theoretical implications. Practically, the proposed IS and OMIS methods can be applied to a range of machine learning tasks, particularly those involving high-dimensional parameters, improving convergence speed and gradient estimation accuracy while maintaining computational efficiency.

Theoretically, this work opens avenues for further exploration into MIS strategies tailored for different model architectures beyond sequential models, particularly transformer-based networks. Future work could involve dynamic sample distribution optimization, extending the framework to more complex architectures, and integrating adaptive sampling strategies resilient against perturbations in estimation.

Conclusion

The paper presents a substantial improvement in gradient estimation for SGD by ingeniously combining multiple importance sampling distributions. These innovations reduce the noise in gradient estimation, hasten convergence, and significantly advance prior IS methodologies, laying the groundwork for future explorations in dynamic MIS strategies for broader machine learning applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets