MambaVC: Learned Visual Compression with Selective State Spaces
Abstract: Learned visual compression is an important and active task in multimedia. Existing approaches have explored various CNN- and Transformer-based designs to model content distribution and eliminate redundancy, where balancing efficacy (i.e., rate-distortion trade-off) and efficiency remains a challenge. Recently, state-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, we take the first step to explore SSMs for visual compression. We introduce MambaVC, a simple, strong and efficient compression network based on SSM. MambaVC develops a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, which helps to capture informative global contexts and enhances compression. On compression benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on Kodak, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC shows even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications. We also provide a comprehensive comparison of different network designs, underscoring MambaVC's advantages. Code is available at https://github.com/QinSY123/2024-MambaVC.
- Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8503–8512, 2020.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Density modeling of images using a generalized normalization transformation. In 4th International Conference on Learning Representations, ICLR 2016, 2016.
- End-to-end optimized image compression. In International Conference on Learning Representations, 2017.
- Variational image compression with a scale hyperprior. In International Conference on Learning Representations, 2018.
- Fabrice Bellard. Bpg image format. http://bellard.org/bpg/, 2018. accessed: 2021-09.
- Overview of the versatile video coding (vvc) standard and its applications. IEEE Transactions on Circuits and Systems for Video Technology, 31(10):3736–3764, 2021.
- Video mamba suite: State space model as a versatile alternative for video understanding. arXiv preprint arXiv:2403.09626, 2024.
- Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7939–7948, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Lossy image compression with quantized hierarchical vaes. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 198–207, 2023.
- Rich Franzen. Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak, 4(2):9, 1999.
- Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2022.
- Learned image compression with gaussian-laplacian-logistic mixture model and concatenated residual modules. IEEE Transactions on Image Processing, 32:2063–2076, 2023.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021a.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021b.
- Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5718–5727, 2022.
- Perceiver io: A general architecture for structured inputs & outputs. In International Conference on Learning Representations, 2021.
- Mlic: Multi-reference entropy model for learned image compression. In Proceedings of the 31st ACM International Conference on Multimedia, pages 7618–7627, 2023.
- JPEG-AI. Jpeg-ai test images. https://jpegai.github.io/test_images/, 2020.
- Contextformer: A transformer with spatio-channel attention for context modeling in learned image compression. In European Conference on Computer Vision, pages 447–463. Springer, 2022.
- On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
- Learned image compression with mixed transformer-cnn architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14388–14397, 2023.
- Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Transformer-based image compression. In 2022 Data Compression Conference (DCC), pages 469–469. IEEE, 2022.
- Understanding the effective receptive field in deep convolutional neural networks. Advances in neural information processing systems, 29, 2016.
- U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
- Long range language modeling via gated state spaces. In International Conference on Learning Representations, 2023.
- Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. In Proceedings of the 11th ACM Multimedia Systems Conference, pages 297–302, 2020.
- Joint autoregressive and hierarchical priors for learned image compression. Advances in neural information processing systems, 31, 2018.
- Entroformer: A transformer-based entropy model for learned image compression. In International Conference on Learning Representations, 2021.
- Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600, 2024.
- Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2022.
- Clic 2020: Challenge on learned image compression, 2020, 2020.
- Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789, 2024.
- Neural data-dependent transform for learned image compression. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 17379–17388, 2022.
- Mcl-jcv: a jnd-based h. 264/avc video quality assessment dataset. In 2016 IEEE international conference on image processing (ICIP), pages 1509–1513. IEEE, 2016.
- Selective structured state-spaces for long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6387–6397, 2023.
- Enhanced invertible encoding for learned image compression. In Proceedings of the 29th ACM international conference on multimedia, pages 162–170, 2021.
- Video enhancement with task-oriented flow. International Journal of Computer Vision, 127:1106–1125, 2019.
- Mambamil: Enhancing long sequence modeling with sequence reordering in computational pathology. arXiv preprint arXiv:2403.06800, 2024.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
- Benchmarking ultra-high-definition image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14769–14778, 2021.
- Residual non-local attention networks for image restoration. In International Conference on Learning Representations, 2019.
- End-to-end optimized image compression with attention mechanism. In CVPR workshops, page 0, 2019.
- Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
- Transformer-based transform coding. In International Conference on Learning Representations, 2021.
- The devil is in the details: Window-based attention for image compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17492–17501, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.