Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Improving Normalization with the James-Stein Estimator (2312.00313v1)

Published 1 Dec 2023 in cs.CV and cs.LG

Abstract: Stein's paradox holds considerable sway in high-dimensional statistics, highlighting that the sample mean, traditionally considered the de facto estimator, might not be the most efficacious in higher dimensions. To address this, the James-Stein estimator proposes an enhancement by steering the sample means toward a more centralized mean vector. In this paper, first, we establish that normalization layers in deep learning use inadmissible estimators for mean and variance. Next, we introduce a novel method to employ the James-Stein estimator to improve the estimation of mean and variance within normalization layers. We evaluate our method on different computer vision tasks: image classification, semantic segmentation, and 3D object classification. Through these evaluations, it is evident that our improved normalization layers consistently yield superior accuracy across all tasks without extra computational burden. Moreover, recognizing that a plethora of shrinkage estimators surpass the traditional estimator in performance, we study two other prominent shrinkage estimators: Ridge and LASSO. Additionally, we provide visual representations to intuitively demonstrate the impact of shrinkage on the estimated layer statistics. Finally, we study the effect of regularization and batch size on our modified batch normalization. The studies show that our method is less sensitive to batch size and regularization, improving accuracy under various setups.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Deep pinsker and james-stein neural networks for decoding motor intentions from limited data. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 29:1058–1067, 2021.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Understanding batch normalization. Advances in neural information processing systems, 31, 2018.
  4. Characterizing signal propagation to close the performance gap in unnormalized resnets. arXiv preprint arXiv:2101.08692, 2021.
  5. High-performance large-scale image recognition without normalization. In International Conference on Machine Learning, pages 1059–1071. PMLR, 2021.
  6. Lawrence D Brown. Admissible estimators, recurrent diffusions, and insoluble boundary value problems. The Annals of Mathematical Statistics, 42(3):855–903, 1971.
  7. C-sure: Shrinkage estimator and prototype classifier for complex-valued deep learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 80–81, 2020.
  8. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  9. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  10. Data analysis using stein’s estimator and its generalizations. Journal of the American Statistical Association, 70(350):311–319, 1975.
  11. The cambridge dictionary of statistics, 2010.
  12. Shrinkage estimation. Springer, 2018.
  13. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  14. Marvin HJ Gruber. Improving efficiency by shrinkage: the James-Stein and ridge regression estimators. Routledge, 2017.
  15. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  16. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
  17. Application of text mining techniques to identify actual wrong-way driving (wwd) crashes in police reports. International Journal of Transportation Science and Technology, 2022.
  18. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
  19. Normalization techniques in training dnns: Methodology, analysis and application. arXiv preprint arXiv:2009.12836, 2020.
  20. Decorrelated batch normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 791–800, 2018.
  21. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
  22. Estimation with quadratic loss. volume 1 of proc. fourth berkeley symp. on math. statist. and prob, 1961.
  23. Estimation with quadratic loss. In Breakthroughs in statistics, pages 443–460. Springer, 1992.
  24. Semantic segmentation using neural ordinary differential equations. In Advances in Visual Computing: 17th International Symposium, ISVC 2022, San Diego, CA, USA, October 3–5, 2022, Proceedings, Part I, pages 284–295. Springer, 2022.
  25. Embedding attention blocks for the vizwiz answer grounding challenge. VizWiz Grand Challenge Workshop, 2023.
  26. Empowering visually impaired individuals: A novel use of apple live photos and android motion photos. In 25th Irish Machine Vision and Image Processing Conference, 2023.
  27. Sentence attention blocks for answer grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6080–6090, 2023.
  28. A transformer-based neural ode for dense prediction. Machine Vision and Applications, 34(6):1–11, 2023.
  29. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  30. A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis, 88(2):365–411, 2004.
  31. Neural architecture design for gpu-efficient networks. arXiv preprint arXiv:2006.14090, 2020.
  32. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12009–12019, 2022.
  33. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  34. 43: Designing evidence based risk assessment system for cancer screening as an applicable approach for the estimating of treatment roadmap. BMJ Open, 7(Suppl 1):bmjopen–2016, 2017.
  35. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
  36. Efficientps: Efficient panoptic segmentation. International Journal of Computer Vision, 129(5):1551–1579, 2021.
  37. F Mosteller. Data analysis, including statistics, the collected works of john w. tukey: Philosophy and principles of data analysis 1965–1986., 1987.
  38. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE international conference on computer vision, pages 4990–4999, 2017.
  39. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
  40. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  41. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. arXiv preprint arXiv:2206.04670, 2022.
  42. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  43. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.
  44. How does batch normalization help optimization? Advances in neural information processing systems, 31, 2018.
  45. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
  46. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  47. C Stein. Inadmissibility of the usual estimator for the mean of a multivariate distribution, vol. 1, 1956.
  48. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  49. Efficientnetv2: Smaller models and faster training. In International Conference on Machine Learning, pages 10096–10106. PMLR, 2021.
  50. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  51. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  52. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
  53. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  54. Deep high-resolution representation learning for visual recognition. arXiv preprint arXiv:1908.07919, 2019.
  55. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12, 2019.
  56. Wikipedia. Shrinkage (statistics). Wikipedia, Sep 2021.
  57. Wikipedia. James–stein estimator. Wikipedia, 2023.
  58. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  59. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  60. Understanding and improving layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  61. Lawin transformer: Improving semantic segmentation transformer with multi-scale representations via large window attention. arXiv preprint arXiv:2201.01615, 2022.
  62. Towards stabilizing batch statistics in backward propagation of batch normalization. arXiv preprint arXiv:2001.06838, 2020.
  63. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
  64. Object-contextual representations for semantic segmentation. In European conference on computer vision, pages 173–190. Springer, 2020.
  65. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
  66. Batch group normalization. arXiv preprint arXiv:2012.02782, 2020.
  67. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com