Frequency-Adaptive Dilated Convolution for Semantic Segmentation (2403.05369v6)
Abstract: Dilated convolution, which expands the receptive field by inserting gaps between its consecutive elements, is widely employed in computer vision. In this study, we propose three strategies to improve individual phases of dilated convolution from the view of spectrum analysis. Departing from the conventional practice of fixing a global dilation rate as a hyperparameter, we introduce Frequency-Adaptive Dilated Convolution (FADC), which dynamically adjusts dilation rates spatially based on local frequency components. Subsequently, we design two plug-in modules to directly enhance effective bandwidth and receptive field size. The Adaptive Kernel (AdaKern) module decomposes convolution weights into low-frequency and high-frequency components, dynamically adjusting the ratio between these components on a per-channel basis. By increasing the high-frequency part of convolution weights, AdaKern captures more high-frequency components, thereby improving effective bandwidth. The Frequency Selection (FreqSelect) module optimally balances high- and low-frequency components in feature representations through spatially variant reweighting. It suppresses high frequencies in the background to encourage FADC to learn a larger dilation, thereby increasing the receptive field for an expanded scope. Extensive experiments on segmentation and object detection consistently validate the efficacy of our approach. The code is publicly available at https://github.com/Linwei-Chen/FADC.
- Adaptive dilated network with self-correction supervision for counting. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 4594–4603, 2020.
- Gregory A Baxes. Digital image processing: principles and applications. John Wiley & Sons, Inc., 1994.
- Regionvit: Regional-to-local attention for vision transformers. In Proceedings of International Conference on Learning Representations, pages 1–15, 2021.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of European Conference on Computer Vision, pages 801–818, 2018.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
- Fast fourier convolution. In Proceedings of Advances in Neural Information Processing Systems, volume 33, pages 4479–4488, 2020.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016.
- Deformable convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, pages 764–773, 2017.
- Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 11963–11975, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations, pages 1–12, 2020.
- Rethinking bisenet for real-time semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 9716–9725, 2021.
- Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22(3):1341–1360, 2020.
- Adaptive fourier neural operators: Efficient token mixers for transformers. In Proceedings of International Conference on Learning Representations, pages 1–12, 2022.
- On the connection between local attention and dynamic depth-wise convolution. In International Conference on Learning Representations, pages 1–14, 2021.
- Dilated neighborhood attention transformer. 2022.
- Neighborhood attention transformer. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 6185–6194, 2023.
- Mask r-cnn. In Proceedings of IEEE International Conference on Computer Vision, pages 2961–2969, 2017.
- Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv preprint arXiv:2101.06085, 2021.
- Anti-aliasing deep image classifiers using novel depth adaptive blurring and activation function. Neurocomputing, 536:164–174, 2023.
- Adaptive frequency filters as efficient global token mixers. In Proceedings of IEEE International Conference on Computer Vision, pages 1–11, 2023.
- Alias-free generative adversarial networks. Proceedings of Advances in Neural Information Processing Systems, 34:852–863, 2021.
- Dilated convolution with learnable spacings. In Proceedings of International Conference on Learning Representations, pages 1–13, 2023.
- Cabinet: Efficient context aggregation network for low-latency semantic segmentation. In IEEE International Conference on Robotics and Automation, pages 13517–13524. IEEE, 2021.
- Wavecnet: Wavelet integrated cnns to suppress aliasing effect for noise-robust image classification. IEEE Transaction on Image Process., 30:7074–7089, 2021.
- Semantic flow for fast and accurate scene parsing. In Proceedings of European Conference on Computer Vision, pages 775–793. Springer, 2020.
- Partial order pruning: for best speed/accuracy trade-off in neural architecture search. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 9145–9153, 2019.
- Scale-aware trident networks for object detection. In Proceedings of IEEE International Conference on Computer Vision, pages 6054–6063, 2019.
- Fourier neural operator for parametric partial differential equations. In Proceedings of International Conference on Learning Representations, pages 1–12, 2021.
- Deep frequency filtering for domain generalization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 11797–11807, 2023.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Microsoft coco: Common objects in context. In Proceedings of European Conference on Computer Vision, pages 740–755, 2014.
- More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. In Proceedings of International Conference on Learning Representations, 2023.
- Partial class activation attention for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 16836–16845, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE International Conference on Computer Vision, pages 10012–10022, 2021.
- A convnet for the 2020s. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022.
- The importance of anti-aliasing in tiny object detection. In Asian Conference on Machine Learning, 2023.
- Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 4061–4070, 2021.
- In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 12607–12616, 2019.
- How do vision transformers work? In Proceedings of International Conference on Learning Representations, pages 1–14, 2021.
- Pp-liteseg: A superior real-time semantic segmentation model. arXiv preprint arXiv:2204.02681, 2022.
- Ioannis Pitas. Digital image processing algorithms and applications. John Wiley & Sons, 2000.
- Digital signal processing: Pearson new international edition. Pearson Higher Ed, 2013.
- Blending anti-aliasing into vision transformer. Proceedings of Advances in Neural Information Processing Systems, 34:5416–5429, 2021.
- Hornet: Efficient high-order spatial interactions with recursive gated convolutions. Proceedings of Advances in Neural Information Processing Systems, 35:10353–10366, 2022.
- Global filter networks for image classification. In Proceedings of Advances in Neural Information Processing Systems, volume 34, pages 980–993, 2021.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of Advances in Neural Information Processing Systems, pages 91–99, 2015.
- Digital signal processing. Addison-Wesley Longman Publishing Co., Inc., 1987.
- Flexconv: Continuous kernel convolutions with differentiable kernel sizes. In Proceedings of International Conference on Learning Representations, pages 1–14, 2021.
- Automatic instrument segmentation in robot-assisted surgery using deep learning. In IEEE international conference on machine learning and applications, pages 624–628. IEEE, 2018.
- Pixel-adaptive convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 11166–11175, 2019.
- Spatial-frequency channels, shape bias, and adversarial robustness. pages 1–10, 2023.
- Densely connected multi-dilated convolutional networks for dense prediction tasks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 993–1002, 2021.
- Defects of convolutional decoder networks in frequency representation. arXiv preprint arXiv:2210.09020, 2022.
- Statistics of natural image categories. Network: computation in neural systems, 14(3):391, 2003.
- Impact of aliasing on generalization in deep convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, pages 10529–10538, 2021.
- Dynamic convolutions: Exploiting spatial sparsity for faster inference. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2320–2329, 2020.
- High-frequency component helps explain the generalization of convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 8684–8694, 2020.
- Understanding convolution for semantic segmentation. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, pages 1451–1460. Ieee, 2018.
- Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 14408–14419, 2023.
- Smoothed dilated convolutions for improved dense prediction. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2486–2495, 2018.
- Vision transformer with deformable attention. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
- Unified perceptual parsing for scene understanding. In Proceedings of European Conference on Computer Vision, pages 418–434, 2018.
- Pidnet: A real-time semantic segmentation network inspired by pid controllers. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 19529–19539, 2023.
- Focal modulation networks. Proceedings of Advances in Neural Information Processing Systems, 35:4203–4217, 2022.
- Adcnn: Towards learning adaptive dilation for convolutional neural networks. Pattern Recognition, 123:108369, 2022.
- A fourier perspective on model robustness in computer vision. In Proceedings of Advances in Neural Information Processing Systems, volume 32, 2019.
- Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129(11):3051–3068, 2021.
- Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of European Conference on Computer Vision, pages 325–341, 2018.
- Dilated residual networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 472–480, 2017.
- Spanet: Frequency-balancing token mixer using spectral pooling aggregation modulation. In Proceedings of IEEE International Conference on Computer Vision, pages 1–16, 2023.
- Richard Zhang. Making convolutional networks shift-invariant again. In Proceedings of International Conference on Machine Learning, pages 7324–7334, 2019.
- Pyramid scene parsing network. In Proceedings of IEEE International Conference on Computer Vision, pages 2881–2890, 2017.
- Scene parsing through ade20k dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 633–641, 2017.
- Decoupled dynamic filter networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 6647–6656, 2021.
- Deformable convnets v2: More deformable, better results. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 9308–9316, 2019.
- Delving deeper into anti-aliasing in convnets. In Proceedings of the British Machine Vision Conference, pages 1–13, 2020.