Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (2403.18814v1)

Published 27 Mar 2024 in cs.CV, cs.AI, and cs.CL
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Abstract: In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision LLMs (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE LLMs from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

Enhancing Vision LLMs with Mini-Gemini: A Dive into Multi-Modality, High-Resolution, and Data Quality

Overview of Mini-Gemini

Mini-Gemini introduces a novel framework aimed at enhancing the capabilities of Vision LLMs (VLMs) by focusing on three key areas: utilization of high-resolution visual tokens, improvement of data quality, and expansion of any-to-any workflow capabilities. By integrating an additional visual encoder, the framework refines high-resolution visual tokens without increasing their count, thereby optimizing computational efficiency. The construction of a high-quality dataset tailored for image comprehension and reasoning-based generation further broadens the operational capabilities of VLMs. Mini-Gemini demonstrates its effectiveness across several dense and Mixture of Experts (MoE) LLMs ranging from 2B to 34B parameters, setting new benchmarks in zero-shot vision tasks.

Technical Insights

Dual Vision Encoders and High-Resolution Image Processing

Mini-Gemini's architecture incorporates dual vision encoders that together enhance the quality and resolution of visual tokens. The low-resolution encoder processes images to create a foundational visual embedding, while the high-resolution encoder provides detailed visual cues. This dual-encoder system, inspired by the Gemini constellation, is designed for efficient processing of high-resolution images without burdening the computational framework with excessive visual tokens.

Enhanced Data Quality

The paper underscores the importance of high-quality data in improving the performance of VLMs. Mini-Gemini leverages a meticulously constructed dataset from various public sources, focusing on image comprehension, text and image generation, and reasoning. The inclusion of high-quality responses and task-oriented instructions significantly contributes to the model's enhanced understanding and generation capabilities.

Expanding VLM Functions

At the heart of Mini-Gemini is an any-to-any inference model that processes both image and text inputs to generate corresponding outputs. This flexibility is achieved through a novel visual token enhancement pipeline and the integration of cutting-edge generative models. The approach not only improves the performance of VLMs in comprehension tasks but also paves the way for innovative applications in image and text generation.

Empirical Validation and Performance

Extensive experiments demonstrate Mini-Gemini's superior performance across a range of zero-shot benchmarks. The framework consistently outperforms existing models, including surpassing developed private models in complex datasets such as MMB and MMU. The empirical results highlight Mini-Gemini's leading capabilities in handling advanced multi-modal tasks, attesting to its potential as a robust tool in the field of VLMs.

Future Directions and Theoretical Implications

The introduction of Mini-Gemini opens new avenues for research in enhancing the performance and applicability of Vision LLMs. The framework's scalable architecture, combined with its focus on high-resolution visual tokens and high-quality data, sets a new standard for future developments in the field. The theoretical exploration of high-resolution image processing and data quality improvements provides valuable insights into the optimization of VLMs. As the community continues to push the boundaries of what's possible with generative AI, Mini-Gemini stands as a significant milestone in the journey towards fully realizing the potential of multi-modality in AI models.

Concluding Remarks

Mini-Gemini represents a significant advancement in the field of Vision LLMs, showcasing the vital role of high-resolution visual processing, quality data, and flexible workflow capabilities. Its exceptional performance across a breadth of benchmarks highlights the effectiveness of its novel approach. As the field moves forward, Mini-Gemini's contributions will undoubtedly serve as a foundation for further innovations, driving the evolution of VLMs towards new heights of capability and application.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023a.
  2. Opt: Open pre-trained transformer language models. arXiv:2205.01068, 2022.
  3. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.
  4. OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023b.
  5. Gemini: a family of highly capable multimodal models. arXiv:2312.11805, 2023.
  6. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023a.
  7. Visual instruction tuning. In NeruIPS, 2023a.
  8. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
  9. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023a.
  10. Llama-vid: An image is worth 2 tokens in large language models. arXiv:2311.17043, 2023b.
  11. Llava-next: Improved reasoning, ocr, and world knowledge, 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  12. Otterhd: A high-resolution multi-modality model. arXiv:2311.04219, 2023c.
  13. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
  14. Sharegpt4v: Improving large multi-modal models with better captions. arXiv:2311.12793, 2023a.
  15. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv:2402.11684, 2024.
  16. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  17. Document collection visual question answering. In ICDAR 2021, 2021.
  18. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv:2203.10244, 2022.
  19. A diagram is worth a dozen images. In ECCV, 2016.
  20. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
  21. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36, 2024.
  22. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  23. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv:2308.12966, 2023.
  24. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023b.
  25. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024.
  26. Attention is all you need. In NeurIPS, 2017.
  27. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
  28. Language models are few-shot learners. In NeurIPS, 2020.
  29. Mixtral of experts. arXiv:2401.04088, 2024.
  30. Finetuned language models are zero-shot learners. arXiv:2109.01652, 2021.
  31. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  32. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  33. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
  34. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023.
  35. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv:2305.18752, 2023.
  36. Google. Gemma: Introducing new state-of-the-art open models. hhttps://blog.google/technology/developers/gemma-open-models/, 2024.
  37. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325, 2015.
  38. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
  39. Lisa: Reasoning segmentation via large language model. arXiv:2308.00692, 2023.
  40. Learning transferable visual models from natural language supervision. In ICML, 2021.
  41. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  42. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023.
  43. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023c.
  44. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023b.
  45. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
  46. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023a.
  47. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023b.
  48. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023a.
  49. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023b.
  50. Llmga: Multimodal large language model based generation assistant. arXiv preprint arXiv:2311.16500, 2023.
  51. Chatillusion: Efficient-aligning interleaved generation ability with visual instruction model. arXiv preprint arXiv:2311.17963, 2023.
  52. Anygpt: Unified multimodal llm with discrete sequence modeling. arXiv preprint arXiv:2402.12226, 2024.
  53. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
  54. LAION-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  55. A convnet for the 2020s. In CVPR, 2022.
  56. Wuerstchen: An efficient architecture for large-scale text-to-image diffusion models, 2023.
  57. OpenAI. Video generation models as world simulators. URL https://openai.com/research/video-generation-models-as-world-simulators.
  58. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  59. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, 2020.
  60. LAION eV. Laion/gpt4v-dataset · datasets at hugging face. URL https://huggingface.co/datasets/laion/gpt4v-dataset.
  61. Dvqa: Understanding data visualizations via question answering. In CVPR, 2018.
  62. stable-diffusion-prompts. URL https://www.gigasheet.com/sample-data/stable-diffusion-prompts.
  63. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv:2312.16886, 2023.
  64. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023b.
  65. IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
  66. Cogvlm: Visual expert for pretrained language models. arXiv:2311.03079, 2023.
  67. Towards vqa models that can read. In CVPR, 2019.
  68. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394, 2023.
  69. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv:2308.02490, 2023.
  70. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR, 2024.
  71. PaddleOCR. Awesome multilingual ocr toolkits based on paddlepaddle. URL https://github.com/PaddlePaddle/PaddleOCR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yanwei Li (36 papers)
  2. Yuechen Zhang (14 papers)
  3. Chengyao Wang (7 papers)
  4. Zhisheng Zhong (20 papers)
  5. Yixin Chen (126 papers)
  6. Ruihang Chu (18 papers)
  7. Shaoteng Liu (16 papers)
  8. Jiaya Jia (162 papers)
Citations (150)
Youtube Logo Streamline Icon: https://streamlinehq.com