Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models (2403.17902v3)
Abstract: The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters, while efficient, are inherently local and therefore struggle with modeling long-range dependencies in images. In contrast, attention excels at capturing global interactions between arbitrary image regions, but suffers from a quadratic cost in image dimension. In this work, we propose Serpent, an efficient architecture for high-resolution image restoration that combines recent advances in state space models (SSMs) with multi-scale signal processing in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. We propose a novel hierarchical architecture inspired by traditional signal processing principles, that converts the input image into a collection of sequences and processes them in a multi-scale fashion. Our experimental results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to $150$ fold reduction in FLOPS) and a factor of up to $5\times$ less GPU memory while maintaining a compact model size. The efficiency gains achieved by Serpent are especially notable at high image resolutions.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 [cs], 2020.
- Deep generative adversarial compression artifact removal. In Proceedings of the IEEE international conference on computer vision, pages 4826–4835, 2017.
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752 [cs.LG], 2023.
- On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
- Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
- Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960.
- A Style-Based Generator Architecture for Generative Adversarial Networks. page 10.
- Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
- SwinIR: Image restoration using Swin Transformer. arXiv:2108.10257, 2021.
- VMamba: Visual State Space Model. arXiv:2401.10166 [cs.CV], 2024.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022.
- Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. Advances in neural information processing systems, 29, 2016.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241, 2015.
- Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv:2401.09417 [cs.CV], 2024.
- Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
- Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8174–8182, 2018.
- Deep learning on image denoising: An overview. Neural Networks, 131:251–275, 2020.
- Pretraining without attention. arXiv:2212.10544 [cs.CL], 2022.
- Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021.
- Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17683–17693, 2022.
- Diffusion models without attention. arXiv:2311.18257 [cs.CV], 2023.
- Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018.
- Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022.
- Deblurring by realistic blurring. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2737–2746, 2020.