Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models (2403.17902v3)

Published 26 Mar 2024 in eess.IV, cs.CV, and cs.LG

Abstract: The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters, while efficient, are inherently local and therefore struggle with modeling long-range dependencies in images. In contrast, attention excels at capturing global interactions between arbitrary image regions, but suffers from a quadratic cost in image dimension. In this work, we propose Serpent, an efficient architecture for high-resolution image restoration that combines recent advances in state space models (SSMs) with multi-scale signal processing in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. We propose a novel hierarchical architecture inspired by traditional signal processing principles, that converts the input image into a collection of sequences and processes them in a multi-scale fashion. Our experimental results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to $150$ fold reduction in FLOPS) and a factor of up to $5\times$ less GPU memory while maintaining a compact model size. The efficiency gains achieved by Serpent are especially notable at high image resolutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 [cs], 2020.
  2. Deep generative adversarial compression artifact removal. In Proceedings of the IEEE international conference on computer vision, pages 4826–4835, 2017.
  3. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752 [cs.LG], 2023.
  4. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
  5. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  6. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
  7. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
  8. Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960.
  9. A Style-Based Generator Architecture for Generative Adversarial Networks. page 10.
  10. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
  11. SwinIR: Image restoration using Swin Transformer. arXiv:2108.10257, 2021.
  12. VMamba: Visual State Space Model. arXiv:2401.10166 [cs.CV], 2024.
  13. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022.
  14. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. Advances in neural information processing systems, 29, 2016.
  15. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241, 2015.
  16. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv:2401.09417 [cs.CV], 2024.
  17. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
  18. Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8174–8182, 2018.
  19. Deep learning on image denoising: An overview. Neural Networks, 131:251–275, 2020.
  20. Pretraining without attention. arXiv:2212.10544 [cs.CL], 2022.
  21. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021.
  22. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17683–17693, 2022.
  23. Diffusion models without attention. arXiv:2311.18257 [cs.CV], 2023.
  24. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018.
  25. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022.
  26. Deblurring by realistic blurring. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2737–2746, 2020.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.