Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Multimodal Models: Notes on CVPR 2023 Tutorial (2306.14895v1)

Published 26 Jun 2023 in cs.CV

Abstract: This tutorial note summarizes the presentation on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4'', a part of CVPR 2023 tutorial onRecent Advances in Vision Foundation Models''. The tutorial consists of three parts. We first introduce the background on recent GPT-like large models for vision-and-LLMing to motivate the research in instruction-tuned large multimodal models (LMMs). As a pre-requisite, we describe the basics of instruction-tuning in LLMs, which is further extended to the multimodal space. Lastly, we illustrate how to build the minimum prototype of multimodal GPT-4 like models with the open-source resource, and review the recently emerged topics.

Analysis of Large Multimodal Models: Exploration and Evolution in the Field

The tutorial paper titled "Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4," presented as part of the CVPR 2023 tutorial series, offers a detailed account of advanced techniques and methodologies in the development of large multimodal models (LMMs). These models represent an extension of LLMs into the multimodal domain, integrating images alongside traditional text processing. The tutorial is particularly focused on instruction-tuned large multimodal models, inspired by recent developments in the capability and design of models analogous to OpenAI's GPT-4, the well-known multimodal variant of GPT models with capabilities beyond language, incorporating visual processing.

Core Content and Contributions

The tutorial is structured in three parts: starting with the motivation and background on multimodal GPT-like models, moving on to the basics of instruction-tuning within LLMs, and concluding with the construction of multimodal prototypes akin to GPT-4 using open-source resources. Here, the introduction of instruction-following abilities through instruction tuning is highlighted as a key advancement for improving multimodal models.

Key Numerical Results: The built prototypes such as LLaVA and MiniGPT-4 have shown significant ability to perform on par, in certain aspects, with proprietary models, achieving up to 85.1% of GPT-4's capability in specific visual chat tasks. Moreover, the synergy of LLaVA with GPT-4 improved performance on science question answering tasks to a new state of the art (SoTA) of 92.53%. These results underscore the prototype's competitiveness against state-of-the-art models.

Implications and Speculations

The tutorial notes highlight the transition from fine-tuning models based on predefined datasets to an instruction-tuning paradigm, promoting greater adaptability and usability in real-world applications. This transformation indicates a significant shift towards more versatile AI systems. By leveraging open-source projects, the community has begun addressing and resolving gaps between existing capabilities and requirements for achieving GPT-4 equivalent functionalities. Recognizing these efforts highlights the democratization of AI capabilities, previously limited to industrially potent, large-scale proprietary models.

Meanwhile, the open-source movement, evidenced by projects like LLaMA and its iterations, becomes essential for driving future advancements without the constraints of proprietary models. These projects enhance model accessibility, posing critical growth points for community-driven research and innovation.

Despite these positive steps, challenges remain in scaling model capabilities. The computing demand and the resource-intensive nature of highly detailed multimodal models were underscored by examples from the OpenAI GPT-4 technical report. This exhibits the persistent gaps that open-source models currently face in reaching fully-fledged parity with GPT-4 in its most powerful, large-scale scenarios.

Future Directions

The paper ends with reflections on sustainable future directions, suggesting that the community should balance between evolving current models and innovating methods to reduce computational barriers. This, in turn, could propel broader model accessibility and user-friendliness. Moreover, it encourages advancing both models' hard scaling for enhanced properties and exploring new features to discern further possibilities within the field of multimodal AI.

In conclusion, the tutorial paper offers an inside look into current methodologies, achievements, and ongoing challenges in the development of instruction-tuned large multimodal models, sketching a roadmap for future inquiries and breakthrough prospects in multimodal artificial intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  2. Openflamingo, March 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160, 2023.
  5. Together Computer. Redpajama-data: An open source recipe to reproduce llama training dataset, 2023.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  7. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  10. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  11. Openllama: An open reproduction of llama, May 2023.
  12. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  13. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  14. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023.
  15. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  16. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
  17. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.
  18. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  19. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023.
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  21. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  22. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023.
  23. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  24. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  25. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023.
  26. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022.
  27. Cheap and quick: Efficient vision-language instruction tuning for large language models. arXiv preprint arXiv:2305.15023, 2023.
  28. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023.
  29. Dataperf: Benchmarks for data-centric ai development. arXiv preprint arXiv:2207.10062, 2022.
  30. Metavl: Transferring in-context learning ability from language models to vision-language models. arXiv preprint arXiv:2306.01311, 2023.
  31. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
  32. OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2022.
  33. OpenAI. GPT-4 technical report. https://arxiv.org/abs/2303.08774, 2023.
  34. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  35. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  36. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023.
  37. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  38. Language models are unsupervised multitask learners. OpenAI blog, 2019.
  39. ShareGPT. https://sharegpt.com/, 2023.
  40. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  41. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  42. Pathasst: Redefining pathology through generative foundation ai assistant for pathology. arXiv preprint arXiv:2305.15072, 2023.
  43. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  44. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023. Accessed: 2023-03-28.
  45. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  46. Attention is all you need. In NeurIPS, 2017.
  47. Vicuna. Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/, 2023.
  48. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  49. VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
  50. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023.
  51. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  52. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2022.
  53. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  54. Instruction-vit: Multi-modal prompts for instruction learning in vit. arXiv preprint arXiv:2305.00201, 2023.
  55. Toward understanding wordart: Corner-guided transformer for scene text recognition, 2022.
  56. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
  57. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773, 2022.
  58. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  59. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687, 2023.
  60. Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
  61. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023.
  62. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  63. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023.
  64. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023.
  65. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103, 2023.
  66. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
  67. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Chunyuan Li (122 papers)
Citations (17)
Youtube Logo Streamline Icon: https://streamlinehq.com