Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Matryoshka Query Transformer for Large Vision-Language Models (2405.19315v2)

Published 29 May 2024 in cs.CV, cs.CL, and cs.LG
Matryoshka Query Transformer for Large Vision-Language Models

Abstract: Large Vision-LLMs (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a LLM. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

Flexibly Adapting Visual Token Budgets: An Analysis of Matryoshka Query Transformers in Vision-LLMs

Overview

The paper introduces the concept of the Matryoshka Query Transformer (MQT) to address the challenge of fixed visual token budgets in Large Vision-LLMs (LVLMs). Traditional LVLMs are often constrained by a fixed number of visual tokens, resulting in inefficiencies when adapting to varying computational constraints across different applications. The proposed MQT model allows for a flexible and adaptive number of visual tokens, considerably enhancing computational efficiency while maintaining robust performance.

Matryoshka Query Transformer (MQT)

Inspiration and Concept

Inspired by Matryoshka Representation Learning, MQT employs a query transformer strategy designed to dynamically adjust the number of visual tokens during inference. During training, the model randomly selects a subset of latent query tokens within a predefined maximum, trimming the rest. This approach results in a Matryoshka-like nested structure, where each token's significance is correlated with its hierarchical placement within the structure.

Technical Implementation

The implementation integrates MQT with the Large Vision-LLM LLaVA, referred to as MQT-LLaVA. The training process is conducted in two stages: initial alignment and subsequent adaptive training with varying numbers of visual tokens. Using this methodology, MQT-LLaVA can effectively encode images into a dynamically chosen number of visual tokens (up to a maximum of 256), as opposed to the fixed 576 tokens in LLaVA-1.5.

Empirical Performance

Strong Numerical Results

MQT-LLaVA, with a maximum of 256 visual tokens, achieves performance on par with or better than LLaVA-1.5 across 11 benchmarks. Remarkably, reducing the token count to 16 (an 8x reduction in TFLOPs) only results in an approximate 2.4-point performance drop on MMBench. Specific tasks such as ScienceQA and MMMU show minimal performance degradation even with as few as 2 visual tokens.

Performance-Efficiency Trade-Offs

The paper finds that different tasks have varying dependencies on the number of visual tokens:

  • High Token Requirement: Tasks such as VQAv2, GQA, and MMBench require more tokens for optimal performance due to their need for detailed visual understanding.
  • Low Token Requirement: Other tasks, including ScienceQA and MME Cognition, maintain robust performance with significantly fewer tokens, suggesting that in these contexts, the LLM's reasoning capabilities overshadow the need for detailed visual tokens.

The flexible adaptation of visual token budgets enables significant computational savings without notable performance trade-offs, particularly for tasks demanding less fine-grained visual detail.

Implications and Future Research

Practical Impact

The proposed MQT-LLaVA model is highly versatile, making it applicable across diverse computational environments, from resource-constrained mobile devices to high-performance servers. The ability to dynamically adjust visual token budgets allows for real-time processing in applications with varying computational constraints.

Theoretical Contributions

The nested Matryoshka-like structure presents a novel means of organizing and efficiently utilizing visual tokens in LVLMs. This approach could influence future LVLM architectures, encouraging ongoing research into adaptive token strategies that further optimize computational efficiency and performance.

Speculative Future Directions

Looking forward, the principles established by MQT could be applied to other modalities beyond images, potentially influencing video and 3D data processing. Further exploration into the balance between the information density of visual tokens and computational cost stands to benefit the development of more scalable and resource-efficient models.

Conclusion

The Matryoshka Query Transformer model presents a substantive step towards addressing the rigidity of fixed visual token budgets in LVLMs. By enabling adaptive visual token counts during inference, the MQT model delivers substantial computational efficiency gains while preserving robust performance across varied vision-language tasks. This advancement underscores the potential for even more adaptable and efficient vision-LLMs in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736. Curran Associates, Inc.
  2. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. ArXiv preprint.
  3. Token merging: Your vit but faster. In The Eleventh International Conference on Learning Representations.
  4. Making large multimodal models understand arbitrary visual prompts. In IEEE Conference on Computer Vision and Pattern Recognition.
  5. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195.
  6. Vicuna: An opensource chatbot impressing gpt-4 with 90% chatgpt quality. ArXiv preprint.
  7. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886.
  8. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  10. Adaptive token sampling for efficient vision transformers. In European Conference on Computer Vision.
  11. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
  12. Sparseformer: Sparse visual recognition via limited latent tokens. In The Twelfth International Conference on Learning Representations.
  13. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
  14. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR.
  15. 3d-llm: Injecting the 3d world into large language models. NeurIPS.
  16. Bliva: A simple multimodal llm for better handling of text-rich visual questions. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024.
  17. Language is not all you need: Aligning perception with language models.
  18. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR.
  19. IDEFICS. 2023. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics.
  20. Phi-2: The surprising power of small language models. Microsoft Research Blog.
  21. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:2402.03161.
  22. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  23. Matformer: Nested transformer for elastic inference. arXiv preprint arXiv:2310.07707.
  24. Matryoshka representation learning. In Advances in Neural Information Processing Systems, volume 35, pages 30233–30249. Curran Associates, Inc.
  25. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. of ICML.
  26. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
  27. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043.
  28. Evaluating object hallucination in large vision-language models. In Proc. of EMNLP.
  29. EVit: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations.
  30. Improved baselines with visual instruction tuning. ArXiv preprint.
  31. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc.
  32. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
  33. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS.
  34. OpenAI. 2023. Gpt-4 technical report.
  35. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824.
  36. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
  37. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems, volume 34, pages 13937–13949. Curran Associates, Inc.
  38. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626.
  39. Large language models as generalizable policies for embodied tasks. ICLR.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  41. mplug-owl: Modularization empowers large language models with multimodality. ArXiv preprint.
  42. A-ViT: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  43. Mm-vet: Evaluating large multimodal models for integrated capabilities. In International conference on machine learning. PMLR.
  44. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore. Association for Computational Linguistics.
  45. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2978–2988.
  46. Tinyllava: A framework of small-scale large multimodal models.
  47. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  48. Llava-phi: Efficient multi-modal assistant with small language model.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wenbo Hu (55 papers)
  2. Zi-Yi Dou (33 papers)
  3. Liunian Harold Li (19 papers)
  4. Amita Kamath (8 papers)
  5. Nanyun Peng (205 papers)
  6. Kai-Wei Chang (292 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com