Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OneLLM: One Framework to Align All Modalities with Language (2312.03700v1)

Published 6 Dec 2023 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.MM
OneLLM: One Framework to Align All Modalities with Language

Abstract: Multimodal LLMs (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

Analysis of OneLLM: A Unified Framework for Multimodal Language Alignment

The paper "OneLLM: One Framework to Align All Modalities with Language" presents a sophisticated approach to multimodal LLMs (MLLMs). This research introduces OneLLM, a unified model designed to comprehend and integrate eight distinct modalities, including image, audio, video, point cloud, and others, with language. The paper navigates the complexities of multimodal learning by proposing a novel architecture and training methodology.

Key Contributions

OneLLM utilizes a single universal multimodal encoder, unified projection module (UPM), and LLM to handle diverse modalities. The model leverages the strengths of pretrained models like CLIP-ViT and LLaMA2, demonstrating robust performance across varied benchmarks.

Architecture

The architecture is characterized by:

  • Lightweight Modality Tokenizers: Each modality is processed through a specific tokenizer, converting inputs into token sequences. This component is optimized for efficiency, especially given the variability across modalities.
  • Universal Encoder and Projection Module: The CLIP-ViT model serves as a frozen encoder, highlighting the transferability of pretrained models. The UPM is innovatively designed with multiple projection experts, guided by a dynamic routing mechanism.
  • LLM Integration: Utilizing LLaMA2 allows for sophisticated language understanding and generation, crucial for aligning visual and auditory data with linguistics.

Multimodal Alignment Strategy

The authors implement a progressive alignment strategy. Initially training on image-text data, the model extends to other modalities progressively, stabilizing representations and preventing modal biases. This approach ensures that new modalities are aligned without adversely impacting previously learned information.

Instruction Tuning

The paper introduces a comprehensive multimodal instruction dataset, which significantly enhances the model's ability to generate multimodal captions, answer questions, and perform reasoning tasks. This dataset is designed to fully exploit the interaction capabilities of OneLLM across all supported modalities.

Experimental Evaluation

OneLLM's effectiveness is validated on 25 benchmarks, spanning tasks like VQA, captioning, reasoning, and more. The model demonstrates competitive performance, often surpassing existing specialist and generalist multimodal models.

  • Vision Tasks: The model shows strong results in VQA and image captioning, nearly rivaling some vision-specific models.
  • Audio and Video Tasks: OneLLM effectively handles both audio and video text tasks, showcasing its versatility in temporal and auditory processing.
  • Emergent Capabilities: The integration of point cloud, depth/normal map, IMU, and fMRI data demonstrates OneLLM’s potential in less-explored areas, such as motion analysis and brain activity interpretation.

Implications and Future Work

OneLLM’s design suggests a scalable direction for research in MLLMs, where a unified framework could accommodate even more modalities. This architecture reduces the need for extensive models with modality-specific designs, potentially simplifying future research and application processes.

However, challenges remain, notably the need for extensive, quality datasets for non-visual modalities and improved methodologies for handling high-resolution and long-sequence data. Future work might focus on fine-grained understanding and expanding modality support with minimal additional resources.

In sum, OneLLM represents a significant step towards versatile, unified models capable of comprehensive multimodal understanding, opening avenues for more integrated AI systems capable of complex real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (105)
  1. Sharegpt. https://sharegpt.com/, 2023.
  2. nocaps: novel object captioning at scale. In ICCV, 2019.
  3. Audio-visual scene-aware dialog. In CVPR, 2019.
  4. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022.
  5. A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience, 25(1):116–126, 2022.
  6. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  7. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
  8. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
  9. Introducing our multimodal models, 2023.
  10. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160, 2023a.
  11. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023b.
  12. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345, 2023c.
  13. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. arXiv preprint arXiv:2305.18500, 2023d.
  14. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  16. Objaverse: A universe of annotated 3d objects. In CVPR, pages 13142–13153, 2023.
  17. Pengi: An audio language model for audio tasks. arXiv preprint arXiv:2305.11834, 2023.
  18. Clotho: An audio captioning dataset. In ICASSP, pages 736–740. IEEE, 2020.
  19. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In ICCV, pages 10786–10796, 2021.
  20. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  21. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  22. Omnivore: A single model for many visual modalities. In CVPR, pages 16102–16112, 2022.
  23. Imagebind: One embedding space to bind them all. In CVPR, pages 15180–15190, 2023.
  24. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021, pages 571–575, 2021.
  25. Listen, think, and understand. arXiv preprint arXiv:2305.10790, 2023.
  26. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
  27. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, pages 18995–19012, 2022.
  28. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023.
  29. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, pages 3608–3617, 2018.
  30. Audioclip: Extending clip to image, text and audio. In ICASSP, pages 976–980. IEEE, 2022.
  31. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905, 2023.
  32. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  33. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  34. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  35. Perceiver: General perception with iterative attention. pages 4651–4664. PMLR, 2021.
  36. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798, 2014.
  37. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019.
  38. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017.
  39. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  40. Multimodal transformer networks for end-to-end video-grounded dialogue systems. arXiv preprint arXiv:1907.01166, 2019.
  41. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  42. Learning to answer questions in dynamic audio-visual scenarios. In CVPR, pages 19108–19118, 2022.
  43. Multi-scale attention for audio question answering. arXiv preprint arXiv:2305.17993, 2023b.
  44. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  45. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023d.
  46. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020.
  47. Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1140–1144. IEEE, 2022.
  48. Improved baselines with visual instruction tuning, 2023a.
  49. Visual instruction tuning. NeurIPS, 2023b.
  50. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
  51. Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 1, 2021.
  52. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 2022.
  53. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
  54. Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023.
  55. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  56. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395, 2023.
  57. Large language models as general pattern machines. In CoRL, 2023.
  58. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952. IEEE, 2019.
  59. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
  60. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  61. From film to video: Multi-turn question answering with multi-modal context. arXiv preprint arXiv:1812.07023, 2018.
  62. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  63. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  64. Video dialog as conversation about objects living in space-time. In ECCV, pages 710–726. Springer, 2022.
  65. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pages 2641–2649, 2015.
  66. From sparse to soft mixtures of experts. arXiv preprint arXiv:2308.00951, 2023.
  67. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  68. Vision transformers for dense prediction. ArXiv preprint, 2021.
  69. Laion coco: 600m synthetic captions from laion2b-en.
  70. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
  71. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, pages 146–162. Springer, 2022.
  72. Reconstructing the mind’s eye: fmri-to-image with contrastive learning and diffusion priors. arXiv preprint arXiv:2305.18274, 2023.
  73. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018.
  74. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, pages 742–758. Springer, 2020.
  75. Towards vqa models that can read. In CVPR, pages 8317–8326, 2019.
  76. Sun rgb-d: A rgb-d scene understanding benchmark suite. In CVPR, pages 567–576, 2015.
  77. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  78. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  79. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  80. Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734, 2019.
  81. Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/, 2023.
  82. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
  83. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, pages 4581–4591, 2019.
  84. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022b.
  85. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP, pages 1–5. IEEE, 2023.
  86. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, pages 9777–9786, 2021.
  87. Video as conditional graph hierarchy for multi-granular question answering. In AAAI, pages 2804–2812, 2022.
  88. Image2point: 3d point-cloud understanding with 2d image pretrained models. In ECCV, pages 638–656. Springer, 2022.
  89. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
  90. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
  91. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
  92. Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911, 2023.
  93. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, pages 1686–1697, 2021.
  94. Zero-shot video question answering via frozen bidirectional language models. NeurIPS, 35:124–141, 2022.
  95. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  96. Featurecut: An adaptive data augmentation for automated audio captioning. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 313–318. IEEE, 2022.
  97. Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988, 2023a.
  98. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
  99. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023a.
  100. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023b.
  101. Pointclip: Point cloud understanding by clip. In CVPR, pages 8552–8562, 2022.
  102. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023c.
  103. Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802, 2023d.
  104. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103, 2023.
  105. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jiaming Han (17 papers)
  2. Kaixiong Gong (12 papers)
  3. Yiyuan Zhang (21 papers)
  4. Jiaqi Wang (218 papers)
  5. Kaipeng Zhang (73 papers)
  6. Dahua Lin (336 papers)
  7. Yu Qiao (563 papers)
  8. Peng Gao (401 papers)
  9. Xiangyu Yue (93 papers)
Citations (66)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com