Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

X-VILA: Cross-Modality Alignment for Large Language Model (2405.19335v1)

Published 29 May 2024 in cs.CV, cs.CL, and cs.LG

Abstract: We introduce X-VILA, an omni-modality model designed to extend the capabilities of LLMs by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT 2019, pages 4171–4186. Association for Computational Linguistics, 2019.
  2. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  3. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  4. OpenAI. ChatGPT: Optimizing language models for dialogue. https://openai.com/blog/chatgpt, 2023. Accessed: 2023.
  5. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  6. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  7. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  9. Mistral–a journey towards reproducible language model training, 2021.
  10. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
  13. Qwen technical report. Technical report, Alibaba Group, 2023. https://arxiv.org/abs/2303.08774.
  14. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  15. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  16. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  17. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023.
  18. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  19. Fuyu-8B: A multimodal architecture for AI agents. https://www.adept.ai/blog/fuyu-8b, 2023.
  20. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  21. OpenAI. GPT-4 technical report. Technical report, OpenAI, 2023. https://arxiv.org/abs/2303.08774.
  22. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  23. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  24. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023.
  25. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
  26. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  27. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  28. Vila: On pre-training for visual language models. CVPR, 2024.
  29. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  30. Lita: Language instructed temporal-localization assistant. arXiv preprint arXiv:2403.19046, 2024.
  31. Any-to-any generation via composable diffusion. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  32. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
  33. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2022.
  34. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
  35. Dense-captioning events in videos. In International Conference on Computer Vision (ICCV), 2017.
  36. Imagebind: One embedding space to bind them all. arXiv: Computer Vision and Pattern Recognition, 2023.
  37. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
  38. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
  39. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  40. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
  41. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  42. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  43. Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
  44. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  45. Audiocaps: Generating captions for audios in the wild. In NAACL-HLT, 2019.
  46. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395, 2023.
  47. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  48. OpenAI. Chatgpt-4o https://www.openai.com/chatgpt, 2024.
  49. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  50. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  51. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  52. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  53. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024.
  54. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  55. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  56. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  57. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv: Computer Vision and Pattern Recognition, 2021.
  58. Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721, 2023.
  59. Learning transferable visual models from natural language supervision. arXiv: Computer Vision and Pattern Recognition, 2021.
  60. Multimodal few-shot learning with frozen language models. arXiv: Computer Vision and Pattern Recognition, 2021.
  61. Timo Lüddecke. Image segmentation using text and image prompts. arXiv: Computer Vision and Pattern Recognition, 2021.
  62. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  63. Images speak in images: A generalist painter for in-context visual learning. arXiv: Computer Vision and Pattern Recognition, 2022.
  64. Palm-e: An embodied multimodal language model. arXiv: Computer Vision and Pattern Recognition, 2023.
  65. Video-ChatGPT: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  66. Regiongpt: Towards region understanding vision language model. CVPR, 2024.
  67. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  68. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  69. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023.
  70. Openflamingo, March 2023.
  71. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Hanrong Ye (17 papers)
  2. De-An Huang (45 papers)
  3. Yao Lu (212 papers)
  4. Zhiding Yu (94 papers)
  5. Wei Ping (51 papers)
  6. Andrew Tao (40 papers)
  7. Jan Kautz (215 papers)
  8. Song Han (155 papers)
  9. Dan Xu (120 papers)
  10. Pavlo Molchanov (70 papers)
  11. Hongxu Yin (49 papers)
Citations (15)