Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Multimodal Large Language Models: A Survey (2405.10739v2)

Published 17 May 2024 in cs.CV and cs.AI
Efficient Multimodal Large Language Models: A Survey

Abstract: In the past year, Multimodal LLMs (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions. Please refer to our GitHub repository for more details: https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey.

Efficient Multimodal LLMs: A Comprehensive Survey

Prepping the Ground for Efficient MLLMs

Multimodal LLMs (MLLMs) have shown impressive abilities in tasks like visual question answering and visual understanding. Yet, their huge model sizes and the demanding cost of training and using them have limited their wider application. This survey paper provides an in-depth review of efficient MLLMs, particularly in light of their potential use in edge computing scenarios. The focus is on lightweight models that maintain strong performance while using fewer resources.

Architecture: Breaking It Down

Core Components

Efficient MLLMs follow the basic framework of conventional MLLMs but are designed with an eye on reducing computational costs. The architecture can generally be divided into three main parts:

  1. Vision Encoder: Processes visual inputs.
  2. LLM: Handles multimodal signals and reasoning.
  3. Vision-Language Projector: Bridges the two modalities.

Key Approaches

Multiple Vision Encoders: Combining different vision encoders offers a diverse range of visual representations, enhancing the model's understanding of visual data. Models like Cobra integrate DINOv2 and SigLIP for better performance.

Lightweight Vision Encoder: Approaches like ViTamin focus on creating smaller vision models without sacrificing accuracy. This makes them suitable for tasks with high-resolution requirements.

Vision-Language Projector: Most use simple MLPs, while others like BLIP2 introduce transformers like Q-Former to capture richer visual features using latent queries.

Small LLM: Efficient MLLMs typically use LLMs with less than 3 billion parameters to save resources while maintaining strong performance. Models like Phi-2 and Gemma-2B serve as excellent examples.

Vision Token Compression: Techniques like token processing and multi-scale information fusion reduce the computational load imposed by high-resolution visual inputs. Methods such as the compression module in LLaVA-UHD help maintain a balance between detailed perception and efficiency.

Efficient Structures: MoE-based models like MoE-LLaVA enhance computational efficiency and performance by using sparseness effectively. Meanwhile, models like VTW improve inference by strategically dropping tokens.

Training: From Scratch to Fine-Tuned

Pre-Training

Pre-training usually involves large datasets of image-caption pairs to build initial multimodal representations. Efficient strategies include assembly of multi-stage pre-training with different image resolutions for optimal performance.

Instruction-Tuning

Instruction-tuning fine-tunes the models using task-specific datasets, including curated conversations and instructions. Approaches like LaVIN manage to reduce training cost significantly while retaining high performance across tasks.

Diverse Training Steps

Some efficient models, like TinyGPT-V, employ multi-stage training processes to iteratively refine their capabilities from basic understanding to advanced multi-task learning.

Parameter-Efficient Transfer Learning

Methods like MemVP propose using visual prompts to inject new visual knowledge into the model, thus reducing the computational burden significantly during both training and inference.

Data and Benchmarks: The Backbone of Performance

Pre-Training Data

Datasets like CC595k and LAION provide extensive image-text pairs necessary for building robust initial models. However, high-quality, fine-grained datasets created with the help of models like GPT-4V offer better performance but come at a higher cost.

Instruction-Tuning Data

Instruction-tuning datasets are derived from a mix of task-specific and general-purpose data, aimed at refining the models' responsiveness to various instructions.

Benchmarks

Performance is evaluated using established benchmarks like VQA and GQA, where efficient models often show competitive results against larger counter-parts. This highlights the success of efficient architectures in maintaining high-quality outputs.

Applications: Spanning Domains

Biomedical Analysis

Efficient MLLMs like MoE-TinyMed have found applications in medical scenarios, providing strong performance with fewer parameters. Models such as LLaVA-Rad outperform larger models in generating radiology reports.

Document Understanding

Efficient models such as TinyChart integrate strategies for fine-grained perception to enhance document understanding, paving the way for applications that require detailed visual and textual analysis.

Video Comprehension

Methods like Video-LLaVA excel by unifying visual representation into a language space, enabling efficient processing of multiple frames in video understanding tasks.

Final Thoughts and Future Directions

Efficient MLLMs are making strides in various fields by balancing performance and resource consumption. However, there's always room for improvement. Broadening input and output modalities, enhancing zero-shot capabilities, and developing embodied agents are some promising future directions that could further establish efficient MLLMs as versatile tools in AI.

In summary, the paper comprehensively surveys the landscape of efficient MLLMs, pointing to robust strategies and promising avenues that hold the potential to bring advanced AI capabilities into practical, resource-constrained environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (197)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  3. mplug-owl: Modularization empowers large language models with multimodality, 2023.
  4. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023.
  5. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
  6. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023.
  7. Visual instruction tuning. In NeurIPS, 2023.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  9. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  10. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  11. Vitamin: Designing scalable vision models in the vision-language era, 2024.
  12. BRAVE: Broadening the visual encoding of vision-language models. arXiv preprint arXiv:2404.07204, 2024.
  13. Cobra: Extending mamba to multi-modal large language model for efficient inference. arXiv preprint arXiv:2403.14520, 2024.
  14. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024.
  15. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  16. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  17. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
  18. Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600, 2024.
  19. Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742, 2023.
  20. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
  21. Llava-phi: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024.
  22. Imp: An emprical study of multimodal small language models, 2024.
  23. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289, 2024.
  24. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530, 2024.
  25. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024.
  26. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024.
  27. Small language model meets with reinforced vision vocabulary. arXiv preprint arXiv:2401.12503, 2024.
  28. Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862, 2023.
  29. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024.
  30. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
  31. Llava-gemma: Accelerating multimodal foundation models with a compact language model. arXiv preprint arXiv:2404.01331, 2024.
  32. A comprehensive overhaul of multimodal assistant with small language models. arXiv preprint arXiv:2403.06199, 2024.
  33. Minicpm-v 2.0: An efficient end-side mllm with strong ocr and understanding capabilities. https://github.com/OpenBMB/MiniCPM-V, 2024.
  34. Deepseek-vl: Towards real-world vision-language understanding, 2024.
  35. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images, 2024.
  36. Texthawk: Exploring efficient fine-grained perception of multimodal large language models. arXiv preprint arXiv:2404.09204, 2024.
  37. Tinychart: Efficient chart understanding with visual token merging and program-of-thoughts learning. arXiv preprint arXiv:2404.16635, 2024.
  38. Plug-and-play grounding of reasoning in multimodal large language models. arXiv preprint arXiv:2403.19322, 2024.
  39. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024.
  40. When do we not need larger vision models? arXiv preprint arXiv:2403.13043, 2024.
  41. Llava-prumerge: Adaptive token reduction for efficient large multimodal models, 2024.
  42. Madtp: Multimodal alignment-guided dynamic token pruning for accelerating vision-language transformer, 2024.
  43. Mova: Adapting mixture of vision experts to multimodal context, 2024.
  44. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  45. On speculative decoding for multimodal large language models, 2024.
  46. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models, 2024.
  47. Boosting multimodal large language models with visual tokens withdrawal for rapid inference, 2024.
  48. What matters when building vision-language models?, 2024.
  49. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023.
  50. Cheap and quick: Efficient vision-language instruction tuning for large language models. Advances in Neural Information Processing Systems, 36, 2024.
  51. Hyperllava: Dynamic visual and language expert tuning for multimodal large language models, 2024.
  52. Not all attention is needed: Parameter and computation efficient transfer learning for multi-modal large language models. arXiv preprint arXiv:2403.15226, 2024.
  53. Memory-space visual prompting for efficient vision-language fine-tuning, 2024.
  54. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  55. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  56. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023.
  57. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  58. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  59. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  60. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024.
  61. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  62. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  63. Training small multimodal models to bridge biomedical competency gap: A case study in radiology imaging, 2024.
  64. Moe-tinymed: Mixture of experts for tiny medical large vision-language models, 2024.
  65. Monkey: Image resolution and text label are important things for large multi-modal models, 2024.
  66. Hrvda: High-resolution visual document assistant. arXiv preprint arXiv:2404.06918, 2024.
  67. mplug-2: A modularized multi-modal foundation model across text, image and video, 2023.
  68. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding, 2024.
  69. Llama-vid: An image is worth 2 tokens in large language models, 2023.
  70. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
  71. Karmavlm: A family of high efficiency and powerful visual language model. https://github.com/thomas-yanxin/KarmaVLM, 2024.
  72. tiny vision language model. https://github.com/vikhyat/moondream, 2024.
  73. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  74. Phi-2: The surprising power of small language models. Microsoft Research Blog, 2023.
  75. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
  76. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  77. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  78. Gemma: Open models based on gemini research and technology, 2024.
  79. Qwen technical report, 2023.
  80. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
  81. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
  82. Tinyllama: An open-source small language model, 2024.
  83. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023.
  84. DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  85. Textbooks are all you need ii: phi-1.5 technical report, 2023.
  86. Phi-3 technical report: A highly capable language model locally on your phone, 2024.
  87. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  88. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
  89. Mixtral of experts, 2024.
  90. Llama 2: Open foundation and fine-tuned chat models, 2023.
  91. Elysium: Exploring object-level perception in videos via mllm, 2024.
  92. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023.
  93. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  94. A survey on efficient vision transformers: Algorithms, techniques, and performance benchmarking. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–20, 2024.
  95. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  96. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems, 35:12934–12949, 2022.
  97. Rethinking vision transformers for mobilenet size and speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16889–16900, 2023.
  98. Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12270–12280, 2021.
  99. Nasvit: Neural architecture search for efficient vision transformers with gradient conflict-aware supernet training. ICLR Proceedings 2022, 2022.
  100. Training-free transformer architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10894–10903, 2022.
  101. Uninet: Unified architecture search with convolution, transformer, and mlp. In European Conference on Computer Vision, pages 33–49. Springer, 2022.
  102. Learning to merge tokens in vision transformers. arXiv preprint arXiv:2202.12015, 2022.
  103. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021.
  104. Sepvit: Separable vision transformer. arXiv preprint arXiv:2203.15380, 2022.
  105. Cap: Correlation-aware pruning for highly-accurate sparse vision models. Advances in Neural Information Processing Systems, 36, 2024.
  106. Cait: Triple-win compression towards high accuracy, fast inference, and favorable transferability for vits. arXiv preprint arXiv:2309.15755, 2023.
  107. Width & depth pruning for vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3143–3151, 2022.
  108. Lu Yu and Wei Xiang. X-pruner: explainable pruning for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24355–24363, 2023.
  109. Vision transformer pruning. arXiv preprint arXiv:2104.08500, 2021.
  110. Patch slimming for efficient vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12165–12174, 2022.
  111. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European conference on computer vision, pages 620–640. Springer, 2022.
  112. Vision transformer slimming: Multi-dimension searching in continuous optimization space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4931–4941, 2022.
  113. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  114. Tinyvit: Fast pretraining distillation for small vision transformers. In European Conference on Computer Vision, pages 68–85. Springer, 2022.
  115. m2mkd: Module-to-module knowledge distillation for modular transformers. arXiv preprint arXiv:2402.16918, 2024.
  116. Learning efficient vision transformers via fine-grained manifold distillation. Advances in Neural Information Processing Systems, 35:9164–9175, 2022.
  117. Minivit: Compressing vision transformers with weight multiplexing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12145–12154, 2022.
  118. Dearkd: Data-efficient early knowledge distillation for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12052–12062, 2022.
  119. Co-advise: Cross inductive bias distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 16773–16782, 2022.
  120. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. In European conference on computer vision, pages 191–207. Springer, 2022.
  121. Towards accurate post-training quantization for vision transformer. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5380–5388, 2022.
  122. Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20321–20330, 2023.
  123. Quantformer: Learning extremely low-precision vision transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  124. Bit-shrinking: Limiting instantaneous sharpness for improving post-training quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16196–16205, 2023.
  125. Q-vit: Accurate and fully quantized low-bit vision transformer. Advances in neural information processing systems, 35:34451–34463, 2022.
  126. Tervit: An efficient ternary vision transformer. arXiv preprint arXiv:2201.08050, 2022.
  127. Bivit: Extremely compressed binary vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5651–5663, 2023.
  128. Packqvit: Faster sub-8-bit vision transformers via full and packed quantization on the mobile. Advances in Neural Information Processing Systems, 36, 2024.
  129. Binaryvit: pushing binary vision transformers towards convolutional models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4664–4673, 2023.
  130. Boost vision transformer with gpu-friendly sparsity and quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22658–22668, 2023.
  131. Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed-scheme quantization. In 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), pages 109–116. IEEE, 2022.
  132. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  133. Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704, 2021.
  134. Adaptive token sampling for efficient vision transformers. In European Conference on Computer Vision, pages 396–414. Springer, 2022.
  135. Unified visual transformer compression. arXiv preprint arXiv:2203.08243, 2022.
  136. Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems, 34:19974–19988, 2021.
  137. A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  138. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  139. Model quantization and hardware acceleration for vision transformers: A comprehensive survey. arXiv preprint arXiv:2405.00314, 2024.
  140. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34:28092–28103, 2021.
  141. Binaryvit: Towards efficient and accurate binary vision transformers. arXiv preprint arXiv:2305.14730, 2023.
  142. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
  143. Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  144. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. Advances in neural information processing systems, 33:4271–4282, 2020.
  145. Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning, pages 3744–3753. PMLR, 2019.
  146. Linformer: Self-attention with linear complexity, 2020.
  147. Masked language modeling for proteins via linearly scalable long-context transformers. arXiv preprint arXiv:2006.03555, 2020.
  148. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  149. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  150. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  151. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  152. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
  153. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933, 2023.
  154. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. arXiv preprint arXiv:2205.05638, 2022.
  155. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303, 2023.
  156. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558, 2022.
  157. Full parameter fine-tuning for large language models with limited resources. arXiv preprint arXiv:2306.09782, 2023.
  158. Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36:53038–53075, 2023.
  159. Efficient large language models: A survey, 2024.
  160. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  161. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  162. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021.
  163. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
  164. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  165. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  166. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  167. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  168. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  169. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  170. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems, 36, 2024.
  171. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024.
  172. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  173. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  174. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.
  175. Large-scale classification of fine-art paintings: Learning the right metric on the right feature. arXiv preprint arXiv:1505.00855, 2015.
  176. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  177. Sharegpt. https://sharegpt.com/, 2023.
  178. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
  179. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  180. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  181. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations, 2023.
  182. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.
  183. LAION. Gpt-4v dataset. https://huggingface.co/datasets/laion/gpt4v-dataset, 2023.
  184. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
  185. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  186. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  187. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  188. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  189. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  190. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
  191. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  192. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  193. Towards generalist biomedical ai. NEJM AI, 1(3):AIoa2300138, 2024.
  194. Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3478–3488, 2021.
  195. Chartllama: A multimodal llm for chart understanding and generation, 2023.
  196. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  197. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Yizhang Jin (4 papers)
  2. Jian Li (667 papers)
  3. Yexin Liu (25 papers)
  4. Tianjun Gu (3 papers)
  5. Kai Wu (134 papers)
  6. Zhengkai Jiang (42 papers)
  7. Muyang He (6 papers)
  8. Bo Zhao (242 papers)
  9. Xin Tan (63 papers)
  10. Zhenye Gan (22 papers)
  11. Yabiao Wang (93 papers)
  12. Chengjie Wang (178 papers)
  13. Lizhuang Ma (145 papers)
Citations (28)
Youtube Logo Streamline Icon: https://streamlinehq.com