Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 86 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Kimi K2 229 tok/s Pro
2000 character limit reached

MambaVC: Learned Visual Compression with Selective State Spaces (2405.15413v3)

Published 24 May 2024 in eess.IV, cs.CV, cs.IT, and math.IT

Abstract: Learned visual compression is an important and active task in multimedia. Existing approaches have explored various CNN- and Transformer-based designs to model content distribution and eliminate redundancy, where balancing efficacy (i.e., rate-distortion trade-off) and efficiency remains a challenge. Recently, state-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, we take the first step to explore SSMs for visual compression. We introduce MambaVC, a simple, strong and efficient compression network based on SSM. MambaVC develops a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, which helps to capture informative global contexts and enhances compression. On compression benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on Kodak, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC shows even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications. We also provide a comprehensive comparison of different network designs, underscoring MambaVC's advantages. Code is available at https://github.com/QinSY123/2024-MambaVC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8503–8512, 2020.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Density modeling of images using a generalized normalization transformation. In 4th International Conference on Learning Representations, ICLR 2016, 2016.
  4. End-to-end optimized image compression. In International Conference on Learning Representations, 2017.
  5. Variational image compression with a scale hyperprior. In International Conference on Learning Representations, 2018.
  6. Fabrice Bellard. Bpg image format. http://bellard.org/bpg/, 2018. accessed: 2021-09.
  7. Overview of the versatile video coding (vvc) standard and its applications. IEEE Transactions on Circuits and Systems for Video Technology, 31(10):3736–3764, 2021.
  8. Video mamba suite: State space model as a versatile alternative for video understanding. arXiv preprint arXiv:2403.09626, 2024.
  9. Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7939–7948, 2020.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  11. Lossy image compression with quantized hierarchical vaes. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 198–207, 2023.
  12. Rich Franzen. Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak, 4(2):9, 1999.
  13. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2022.
  14. Learned image compression with gaussian-laplacian-logistic mixture model and concatenated residual modules. IEEE Transactions on Image Processing, 32:2063–2076, 2023.
  15. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  16. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021a.
  17. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021b.
  18. Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5718–5727, 2022.
  19. Perceiver io: A general architecture for structured inputs & outputs. In International Conference on Learning Representations, 2021.
  20. Mlic: Multi-reference entropy model for learned image compression. In Proceedings of the 31st ACM International Conference on Multimedia, pages 7618–7627, 2023.
  21. JPEG-AI. Jpeg-ai test images. https://jpegai.github.io/test_images/, 2020.
  22. Contextformer: A transformer with spatio-channel attention for context modeling in learned image compression. In European Conference on Computer Vision, pages 447–463. Springer, 2022.
  23. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  24. Learned image compression with mixed transformer-cnn architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14388–14397, 2023.
  25. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  26. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  27. Transformer-based image compression. In 2022 Data Compression Conference (DCC), pages 469–469. IEEE, 2022.
  28. Understanding the effective receptive field in deep convolutional neural networks. Advances in neural information processing systems, 29, 2016.
  29. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  30. Long range language modeling via gated state spaces. In International Conference on Learning Representations, 2023.
  31. Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  32. Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. In Proceedings of the 11th ACM Multimedia Systems Conference, pages 297–302, 2020.
  33. Joint autoregressive and hierarchical priors for learned image compression. Advances in neural information processing systems, 31, 2018.
  34. Entroformer: A transformer-based entropy model for learned image compression. In International Conference on Learning Representations, 2021.
  35. Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600, 2024.
  36. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
  37. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2022.
  38. Clic 2020: Challenge on learned image compression, 2020, 2020.
  39. Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789, 2024.
  40. Neural data-dependent transform for learned image compression. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 17379–17388, 2022.
  41. Mcl-jcv: a jnd-based h. 264/avc video quality assessment dataset. In 2016 IEEE international conference on image processing (ICIP), pages 1509–1513. IEEE, 2016.
  42. Selective structured state-spaces for long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6387–6397, 2023.
  43. Enhanced invertible encoding for learned image compression. In Proceedings of the 29th ACM international conference on multimedia, pages 162–170, 2021.
  44. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127:1106–1125, 2019.
  45. Mambamil: Enhancing long sequence modeling with sequence reordering in computational pathology. arXiv preprint arXiv:2403.06800, 2024.
  46. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  47. Benchmarking ultra-high-definition image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14769–14778, 2021.
  48. Residual non-local attention networks for image restoration. In International Conference on Learning Representations, 2019.
  49. End-to-end optimized image compression with attention mechanism. In CVPR workshops, page 0, 2019.
  50. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
  51. Transformer-based transform coding. In International Conference on Learning Representations, 2021.
  52. The devil is in the details: Window-based attention for image compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17492–17501, 2022.
Citations (7)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.