Papers
Topics
Authors
Recent
2000 character limit reached

Libra: Building Decoupled Vision System on Large Language Models (2405.10140v1)

Published 16 May 2024 in cs.CV

Abstract: In this work, we introduce Libra, a prototype model with a decoupled vision system on a LLM. The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. Code is available at https://github.com/YifanXu74/Libra.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8948–8957, 2019.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp.  2425–2433, 2015.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  5. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  6. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9650–9660, 2021.
  7. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3558–3568, 2021.
  8. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a.
  9. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  10. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023b.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  12. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  958–979, 2024.
  13. Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500, 2023.
  14. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  15. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  16. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
  17. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  18. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12873–12883, 2021.
  19. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  20. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  21. Google. Bard. https://bard.google.com/, 2023a.
  22. Google. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023b.
  23. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14953–14962, 2023.
  24. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  3608–3617, 2018.
  25. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16000–16009, 2022.
  26. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  27. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  28. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  29. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6700–6709, 2019.
  30. IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
  31. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  32. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  33. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
  34. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  35. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
  36. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 1(2):2, 2023c.
  37. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023d.
  38. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023e.
  39. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  40. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  41. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
  42. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022a.
  43. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023.
  44. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022b.
  45. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, pp.  3195–3204, 2019.
  46. OpenAI. Gpt-4v(ision) system card. https://openai.com/research/gpt-4v-system-card, 2023.
  47. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  48. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  49. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision, pp.  2641–2649, 2015.
  50. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021.
  51. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021.
  52. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  53. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  54. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  55. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023a.
  56. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023b.
  57. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  58. The emergent properties of the connected brain. Science, 378(6619):505–510, 2022.
  59. Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209, 2024.
  60. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  61. Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 2017.
  62. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pp.  23318–23340. PMLR, 2022a.
  63. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19175–19186, 2023a.
  64. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b.
  65. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b.
  66. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  67. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  68. Towards corruption-agnostic robust domain adaptation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(4):1–16, 2022a.
  69. Transformers in computational visual media: A survey. Computational Visual Media, 8:33–62, 2022b.
  70. Evo-vit: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  2964–2972, 2022c.
  71. Multi-modal queried object detection in the wild. arXiv preprint arXiv:2305.18980, 2023a.
  72. Exploring multi-modal contextual knowledge for open-vocabulary object detection. arXiv preprint arXiv:2308.15846, 2023b.
  73. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773, 2022d.
  74. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  75. Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=1SIBN5Xyw7.
  76. Spike-driven transformer. Advances in Neural Information Processing Systems, 36, 2024b.
  77. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  78. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
  79. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023a.
  80. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
  81. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  82. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: