Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Matryoshka Multimodal Models (2405.17430v2)

Published 27 May 2024 in cs.CV, cs.AI, cs.CL, and cs.LG
Matryoshka Multimodal Models

Abstract: Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a LLM. However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.

Matryoshka Multimodal Models: Enhancing Efficiency and Flexibility in Visual Token Representation

Large Multimodal Models (LMMs) have demonstrated significant progress in visual-linguistic reasoning tasks. Traditional models like LLaVA embed input images into a fixed number of visual tokens for subsequent processing by a LLM. However, this approach leads to inefficiencies, especially in dense visual contexts such as high-resolution images and long videos. The proposed solution by the authors is the introduction of the Matryoshka Multimodal Models (M3^3), designed to represent visual content with nested sets of visual tokens, capturing information from coarse-to-fine granularities. This summary explores the methodology, implications, and performance of M3^3, providing insights into its contributions and potential future developments.

Methodology

The central innovation of M3^3 is the representation of visual content as multiple nested sets of visual tokens, enabling explicit control over the visual granularity at inference time. This methodology draws inspiration from Matryoshka dolls, where larger structures encompass smaller, detailed components. Specifically, M3^3 modifies the visual token generation process by pooling tokens in a hierarchical manner, thereby producing token sets of varying granularity that can be selectively used based on the complexity of the visual input.

The training objective is straightforward yet powerful: it involves maximizing the likelihood of the predicted tokens matching the ground-truth answers, averaged over all scales of visual tokens. This approach involves no additional learnable parameters beyond those in the visual encoder and LLM. Rather, it optimizes the existing architecture to accommodate and leverage the hierarchical token representations.

Experimental Evaluation

The performance of M3^3 was evaluated on several benchmarks focusing on both image and video understanding tasks. Notably, the results demonstrated that M3^3 achieved comparable or superior performance to existing models while offering significant efficiency gains. For instance, in the MMBench evaluation, M3^3 with 9 tokens per image performed on par with models using far more tokens, such as Qwen-VL-Chat with 256 tokens.

In video understanding tasks, M3^3 showcased its ability to maintain performance while reducing the number of tokens. Interestingly, certain video tasks benefited from the compact representation offered by M3^3, where models using fewer tokens outperformed those using the full token set.

Implications and Future Directions

The implications of M3^3 span both practical and theoretical dimensions. Practically, the ability to adjust the granularity of visual tokens dynamically allows for more efficient deployment of LMMs, particularly in resource-constrained environments. This is particularly valuable for applications involving high-resolution images or long videos, where traditional models struggle with inefficiency.

Theoretically, M3^3 highlights the potential of hierarchical representations in enhancing model performance and efficiency. It provides a foundation for further exploration into adaptive token length strategies and the underlying biases in visual benchmarks. The significant performance gap between models using full tokens and the oracle upper bound suggests that there is considerable room for optimization, potentially through the development of sophisticated token length predictors.

Conclusions

The introduction of M3^3 marks a significant step forward in the efficient representation of visual content within LMMs. The model's ability to dynamically adjust visual granularity during inference offers both improved performance and efficiency. The results demonstrated across various benchmarks affirm the robustness and flexibility of M3^3. Future research can build on these findings to develop models that optimize token usage further and extend the principles of hierarchical representation to other domains, such as text and dense vision tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
  2. Visual instruction tuning. NeurIPS, 2023.
  3. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ICLR, 2024.
  4. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  5. Improved baselines with visual instruction tuning, 2024.
  6. Cogvlm: Visual expert for pretrained language models, 2023.
  7. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
  8. Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/, 2023.
  9. Meta. Llama-3. https://ai.meta.com/blog/meta-llama-3/, 2024.
  10. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  11. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  12. Llava-next: A strong zero-shot video understanding model, April 2024.
  13. Gemini Team. Gemini: A family of highly capable multimodal models, 2024.
  14. Token merging: Your ViT but faster. In International Conference on Learning Representations, 2023.
  15. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. arXiv preprint arXiv:2403.06764, 2024.
  16. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv preprint arXiv:2403.15388, 2024.
  17. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  18. Coarse-grained information dominates fine-grained information in judgments of time-to-contact from retinal flow. Vision research, 40(6):601–611, 2000.
  19. Jay Hegdé. Time course of visual perception: coarse-to-fine processing and beyond. Progress in neurobiology, 84(4):405–439, 2008.
  20. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249, 2022.
  21. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  22. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  23. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  24. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  25. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  26. OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023.
  27. OpenAI. Gpt-4 technical report. 2023.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  30. Making large multimodal models understand arbitrary visual prompts. In IEEE Conference on Computer Vision and Pattern Recognition, 2024.
  31. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023.
  32. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  33. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  34. Llava-grounding: Grounded visual chat with large multimodal models, 2023.
  35. 3d-llm: Injecting the 3d world into large language models. NeurIPS, 2023.
  36. Deep residual learning for image recognition. In CVPR, 2016.
  37. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  38. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  39. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  40. 2d matryoshka sentence embeddings. arXiv preprint arXiv:2402.14776, 2024.
  41. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  42. Linformer: Self-attention with linear complexity, 2020.
  43. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020.
  44. Which tokens to use? investigating token reduction in vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, October 2023.
  45. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
  46. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  47. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  48. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  49. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022.
  50. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of CVPR, 2024.
  51. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  52. A diagram is worth a dozen images. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 235–251, Cham, 2016. Springer International Publishing.
  53. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
  54. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, volume 33, pages 9127–9134, 2019.
  55. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
  56. Intentqa: Context-aware video intent reasoning. In Int. Conf. Comput. Vis., pages 11963–11974, 2023.
  57. Egoschema: A diagnostic benchmark for very long-form video language understanding. In Adv. Neural Inform. Process. Syst., 2024.
  58. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  59. An image grid can be worth a video: Zero-shot video question answering using a vlm. arXiv preprint arXiv:2403.18406, 2024.
  60. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Conf. Empirical Methods in Natural Language Processing, pages 543–553, 2023.
  61. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  62. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv abs/2306.05424, 2023.
  63. Video-llava: Learning united visual representation by alignment before projection. ArXiv abs/2311.10122, 2023.
  64. Internvideo: General video foundation models via generative and discriminative learning. ArXiv abs/2212.03191, 2022.
  65. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363, 2024.
  66. Asvd: Activation-aware singular value decomposition for compressing large language models. arXiv preprint arXiv:2312.05821, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mu Cai (21 papers)
  2. Jianwei Yang (93 papers)
  3. Jianfeng Gao (344 papers)
  4. Yong Jae Lee (88 papers)
Citations (14)
Youtube Logo Streamline Icon: https://streamlinehq.com