Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VL-Mamba: Exploring State Space Models for Multimodal Learning (2403.13600v1)

Published 20 Mar 2024 in cs.CV

Abstract: Multimodal LLMs (MLLMs) have attracted widespread interest and have rich applications. However, the inherent attention mechanism in its Transformer structure requires quadratic complexity and results in expensive computational overhead. Therefore, in this work, we propose VL-Mamba, a multimodal LLM based on state space models, which have been shown to have great potential for long-sequence modeling with fast inference and linear scaling in sequence length. Specifically, we first replace the transformer-based backbone LLM such as LLama or Vicuna with the pre-trained Mamba LLM. Then, we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning and the combinations of different vision encoders and variants of pretrained Mamba LLMs. The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Vqa: Visual question answering. IJCV, 123:4 – 31, 2015.
  3. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022.
  4. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
  5. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023.
  6. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  7. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  9. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2022.
  10. Mobilevlm : A fast, strong and open vision language assistant for mobile devices. ArXiv, abs/2312.16886, 2023.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. OpenReview.net, 2021.
  13. Palm-e: An embodied multimodal language model. In ICML, 2023.
  14. Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv, abs/2306.13394, 2023.
  15. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  16. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR, 2017.
  17. A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  18. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
  19. On the parameterization and initialization of diagonal state space models. In NeurIPS, 2022.
  20. Efficiently modeling long sequences with structured state spaces. In ICLR. OpenReview.net, 2022.
  21. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In NeurIPS, pages 572–585, 2021.
  22. Diagonal state spaces are as effective as structured state spaces. NeurIPS, 35:22982–22994, 2022.
  23. Gqa: A new dataset for real-world visual reasoning and compositional question answering. CVPR, pages 6693–6702, 2019.
  24. R. E. Kalman. A new approach to linear filtering and prediction problems. 1960.
  25. A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. CVPR, pages 3128–3137, 2014.
  26. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024.
  27. Otter: A multi-modal model with in-context instruction tuning. ArXiv, abs/2305.03726, 2023.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  29. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  30. Evaluating object hallucination in large vision-language models. 2023.
  31. Improved baselines with visual instruction tuning. ArXiv, abs/2310.03744, 2023.
  32. Visual instruction tuning, 2023.
  33. Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023.
  34. Vmamba: Visual state space model. ArXiv, abs/2401.10166, 2024.
  35. Learn to explain: Multimodal reasoning via thought chains for science question answering. ArXiv, abs/2209.09513, 2022.
  36. U-mamba: Enhancing long-range dependency for biomedical image segmentation. ArXiv, abs/2401.04722, 2024.
  37. Referring expression comprehension: A survey of methods and datasets. IEEE TMM, 23:4426–4440, 2020.
  38. Learning transferable visual models from natural language supervision. In ICML, 2021.
  39. J. Ruan and S. Xiang. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491, 2024.
  40. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
  41. Towards vqa models that can read. CVPR, pages 8309–8318, 2019.
  42. Simplified state space layers for sequence modeling. In ICLR, 2023.
  43. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  44. Attention is all you need. In NeurIPS, 2017.
  45. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015.
  46. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. ArXiv, abs/2401.13560, 2024.
  47. Vivim: a video vision mamba for medical video object segmentation. ArXiv, abs/2401.14168, 2024.
  48. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  49. Mattnet: Modular attention network for referring expression comprehension. In CVPR, pages 1307–1315, 2018.
  50. Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv, abs/2308.02490, 2023.
  51. Sigmoid loss for language image pre-training. ICCV, pages 11941–11952, 2023.
  52. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.
  53. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  54. Vision mamba: Efficient visual representation learning with bidirectional state space model. ArXiv, abs/2401.09417, 2024.
  55. Llava-phi: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yanyuan Qiao (20 papers)
  2. Zheng Yu (29 papers)
  3. Longteng Guo (31 papers)
  4. Sihan Chen (39 papers)
  5. Zijia Zhao (17 papers)
  6. Mingzhen Sun (10 papers)
  7. Qi Wu (323 papers)
  8. Jing Liu (526 papers)
Citations (39)

Summary

We haven't generated a summary for this paper yet.