NVLM: Open Frontier-Class Multimodal LLMs (2409.11402v2)

Published 17 Sep 2024 in cs.CL, cs.AI, cs.CV, cs.LG, and cs.MM

Abstract: We introduce NVLM 1.0, a family of frontier-class multimodal LLMs that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we release the model weights at https://huggingface.co/nvidia/NVLM-D-72B and will open-source the training code for the community soon.

PDF Abstract

NVLM: Frontier-Class Multimodal LLMs

The paper entitled "NVLM: Open Frontier-Class Multimodal LLMs" introduces NVLM 1.0, a family of LLMs that offer state-of-the-art performance on vision-language tasks and maintain superior text-only capabilities. This paper situates its contributions within the context of rapidly advancing AI technology, specifically focusing on multimodal LLMs, which can process and link multiple forms of data, such as text and images. NVLM 1.0 is compared against prominent models in this domain and shines due to its architecture, training data quality, and methodological advancements.

Summary of Contributions

The paper makes several key contributions:

Comprehensive Model Design:
- NVLM 1.0 includes three distinct architectural types: NVLM-D (Decoder-only), NVLM-X (Cross-attention), and NVLM-H (Hybrid).
- The design details and comparison between these approaches are thoroughly analyzed. The cross-attention model, NVLM-X, and the novel hybrid model, NVLM-H, are highlighted for their efficient processing of high-resolution image inputs and their strong multimodal reasoning capabilities.
Training Efficiency and Performance:
- The paper addresses the training efficiency challenge by introducing a 1-D tile-tagging mechanism for high-resolution images, bolstering the performance on OCR and multimodal reasoning tasks.
- It underscores that dataset quality and task diversity are more crucial than sheer scale during pretraining, and combines these insights to craft a highly curated dataset for optimal performance.
Enhanced Multimodal and Text-Only Performance:
- NVLM 1.0 enhances both vision-language and text-only performance. Remarkable improvements in math and coding benchmarks are seen, attributed to the integration of high-quality text-only datasets and multimodal training data.
- The models' open-access nature and transparency in architecture and training datasets mark significant steps towards democratizing AI research and development.
Comparison with Leading Models:
- The NVLM-D `72B model demonstrates superior performance in VQAv2, OCRBench, and acts as a competitive alternative to proprietary models like GPT-4o, providing a non-degradative text-only performance.
- The hybrid NVLM-H model showcases exceptional performance in multimodal reasoning benchmarks, such as MMMU and MathVista, balancing computational efficiency with robust reasoning capabilities.

Strong Numerical Results

The paper provides detailed empirical evaluations across various benchmarks:

OCRBench: NVLM-D 72B achieves the highest score of 853, surpassing leading models, including GPT-4o.
VQAv2: NVLM-D 72B scores an impressive 85.4, again leading the chart.
MMMU (Validation Set): NVLM-H `72B tops with a score of 60.2, the highest among competitive models.

In text-only tasks, the integration of high-quality datasets during supervised fine-tuning is pivotal. For example, NVLM-D `72B shows an improvement in average text benchmark accuracy by 4.3 points, which distinctly highlights the model’s robustness in the text domain.

Theoretical and Practical Implications

The results imply several theoretical advancements and practical applications:

Theoretical Advancements:
- The hybrid architecture NVLM-H introduces a novel approach to achieving better computational efficiency and multimodal reasoning. This blend of cross-attention and decoder-only methodologies may pave the way for future multimodal model designs.
- The tile-tagging mechanism for dynamic high-resolution input represents a refinement in handling diverse data efficiently, potentially influencing future image processing techniques in multimodal LLMs.
Practical Applications:
- NVLM models can be applied in diverse scenarios requiring robust understanding and reasoning with both text and visual data, such as document analysis, automated image and text-based reporting, and educational tools.
- The open-source nature of NVLM-1.0 makes it a valuable asset for the research community, fostering further innovation and development in multimodal AI technologies.

Future Directions

The research opens several avenues for future exploration:

Scaling and Optimization: Continuously scaling NVLM with even more diverse datasets and optimized architectures could further enhance its performance, particularly in less-explored multimodal nuances.
Extension to Other Modalities: While the current focus is on text and vision, extending such architectures to encompass speech, video, and other modalities could significantly broaden the application range of NVLM models.
Fine-Grained Multimodal Reasoning: Investigating more fine-grained multimodal reasoning tasks, such as those involving complex interactions between text and images, can help better understand and improve NVLM's capabilities.

Conclusion

NVLM 1.0 stands out as an advanced, innovative contribution to the domain of multimodal LLMs, bridging the gap between research and practical application. The methodologies and insights provided pave the way for future advancements, making NVLM-1.0 not just a competitive model but also a foundational framework for future multimodal LLM research and development.