Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scale Equalization for Multi-Level Feature Fusion (2402.01149v1)

Published 2 Feb 2024 in cs.CV

Abstract: Deep neural networks have exhibited remarkable performance in a variety of computer vision fields, especially in semantic segmentation tasks. Their success is often attributed to multi-level feature fusion, which enables them to understand both global and local information from an image. However, we found that multi-level features from parallel branches are on different scales. The scale disequilibrium is a universal and unwanted flaw that leads to detrimental gradient descent, thereby degrading performance in semantic segmentation. We discover that scale disequilibrium is caused by bilinear upsampling, which is supported by both theoretical and empirical evidence. Based on this observation, we propose injecting scale equalizers to achieve scale equilibrium across multi-level features after bilinear upsampling. Our proposed scale equalizers are easy to implement, applicable to any architecture, hyperparameter-free, implementable without requiring extra computational cost, and guarantee scale equilibrium for any dataset. Experiments showed that adopting scale equalizers consistently improved the mIoU index across various target datasets, including ADE20K, PASCAL VOC 2012, and Cityscapes, as well as various decoder choices, including UPerHead, PSPHead, ASPPHead, SepASPPHead, and FCNHead.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Layer Normalization. CoRR, abs/1607.06450, 2016.
  2. ReZero is all you need: fast convergence at large depth. In UAI, volume 161, pp.  1352–1361, 2021.
  3. BEiT: BERT Pre-Training of Image Transformers. In ICLR, 2022.
  4. Rethinking Atrous Convolution for Semantic Image Segmentation. CoRR, abs/1706.05587, 2017.
  5. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2018a.
  6. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In ECCV (7), volume 11211, pp.  833–851, 2018b.
  7. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. In NeurIPS, pp.  9355–9366, 2021.
  8. Contributors, M. MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  9. The Cityscapes Dataset for Semantic Urban Scene Understanding. In CVPR, pp.  3213–3223, 2016.
  10. ImageNet: A large-scale hierarchical image database. In CVPR, pp.  248–255, 2009.
  11. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.
  12. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis., 111(1):98–136, 2015.
  13. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, volume 9, pp.  249–256, 2010.
  14. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In ICCV, pp.  1026–1034, 2015.
  15. Deep Residual Learning for Image Recognition. In CVPR, pp.  770–778, 2016.
  16. Masked Autoencoders Are Scalable Vision Learners. In CVPR, pp.  15979–15988, 2022.
  17. Decorrelated Batch Normalization. In CVPR, pp.  791–800, 2018.
  18. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, volume 37, pp.  448–456, 2015.
  19. Panoptic Feature Pyramid Networks. In CVPR, pp.  6399–6408, 2019.
  20. Self-Normalizing Neural Networks. In NIPS, pp.  971–980, 2017.
  21. Feature Pyramid Networks for Object Detection. In CVPR, pp.  936–944, 2017.
  22. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In ICCV, pp.  9992–10002, 2021.
  23. A ConvNet for the 2020s. In CVPR, pp.  11966–11976, 2022.
  24. Fully convolutional networks for semantic segmentation. In CVPR, pp.  3431–3440, 2015.
  25. Decoupled Weight Decay Regularization. In ICLR, 2019.
  26. Mixed Precision Training. In ICLR, 2018.
  27. On the Convergence of Adam and Beyond. In ICLR, 2018.
  28. U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI (3), volume 9351, pp.  234–241, 2015.
  29. MAXIM: Multi-Axis MLP for Image Processing. In CVPR, pp.  5759–5770, 2022.
  30. Unified Perceptual Parsing for Scene Understanding. In ECCV (5), volume 11209, pp.  432–448, 2018.
  31. Cross-Iteration Batch Normalization. In CVPR, pp.  12331–12340, 2021.
  32. Multi-Scale Context Aggregation by Dilated Convolutions. In ICLR, 2016.
  33. Zeiler, M. D. ADADELTA: An Adaptive Learning Rate Method. CoRR, abs/1212.5701, 2012.
  34. Context Encoding for Semantic Segmentation. In CVPR, pp.  7151–7160, 2018.
  35. Pyramid Scene Parsing Network. In CVPR, pp.  6230–6239, 2017.
  36. Rethinking Semantic Segmentation From a Sequence-to-Sequence Perspective With Transformers. In CVPR, pp.  6881–6890, 2021.
  37. Semantic Understanding of Scenes Through the ADE20K Dataset. Int. J. Comput. Vis., 127(3):302–321, 2019.
Citations (1)

Summary

We haven't generated a summary for this paper yet.