Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models (2409.17146v1)

Published 25 Sep 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at https://molmo.allenai.org.

Authors (51)

Matt Deitke (11 papers)
Christopher Clark (27 papers)
Sangho Lee (25 papers)
Rohun Tripathi (6 papers)
Yue Yang (146 papers)
Jae Sung Park (35 papers)
Mohammadreza Salehi (26 papers)
Niklas Muennighoff (56 papers)
Kyle Lo (73 papers)
Luca Soldaini (62 papers)
Jiasen Lu (32 papers)
Taira Anderson (3 papers)
Erin Bransom (11 papers)
Kiana Ehsani (31 papers)
Huong Ngo (2 papers)
YenSung Chen (2 papers)
Ajay Patel (17 papers)
Mark Yatskar (38 papers)
Chris Callison-Burch (102 papers)
Andrew Head (17 papers)

Citations (2)

View on Semantic Scholar

Summary

Molmo and PixMo: A Comprehensive Overview

The paper "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models" presents an extensive paper on the development of advanced open-weight multimodal LLMs (VLMs). The primary contributions of this work are the Molmo family of VLMs and the PixMo datasets, both created with an emphasis on transparency and openness while maintaining high performance in various tasks.

Key Contributions

Molmo VLMs: Molmo models are designed to be state-of-the-art within the class of open-weight models. The authors introduce a variety of models within the Molmo family, each based on different LLM backbones (e.g., OLMo-7B-1024, OLMoE-1B-7B, Qwen2 7B, and Qwen2 72B). These models incorporate a vision encoder and a transformer-based architecture optimized for generating detailed image captions and performing diverse visual reasoning tasks.
PixMo Datasets: The innovation in data collection is critical to the success of Molmo models. The authors collected the PixMo dataset entirely from human annotators using speech-based descriptions, creating a diverse and high-quality dataset of detailed image captions. This approach avoids the reliance on synthetic data generated by proprietary models, a common practice in recent open VLM developments.
Open Release: The authors emphasize the open release of their models, datasets, and training pipelines. This level of transparency is designed to foster further research and development in the community, addressing the current gap in foundational knowledge on building VLMs from scratch.

Architecture and Training

The architecture of Molmo models follows a straightforward design combining a vision encoder with a LLM. The training process is divided into two main stages:

Caption Generation: In the first stage, the vision encoder and LLM are jointly trained on the PixMo-Cap dataset, focusing on generating detailed image captions. This dataset was collected through a meticulous process where human annotators described images in detail by speaking, followed by transcription and quality improvement by a language-only LLM.
Supervised Fine-Tuning: In the second stage, the models are fine-tuned on a mixture of academic datasets and newly collected PixMo datasets. These datasets encompass a wide range of tasks from real-world user queries to specialized tasks like clock reading and document question answering. The PixMo-Points dataset is particularly noteworthy for its introduction of 2D pointing data, enabling models to interact more naturally with visual content by pointing and counting.

Evaluation and Results

The performance of Molmo models was rigorously evaluated using both academic benchmarks and human preference rankings. Key highlights include:

Academic Benchmarks: Molmo models were tested on 11 commonly used benchmarks, performing admirably across the board. The Molmo-72B model, based on Qwen2 72B, achieved the highest scores among open-weight models and ranked second in human preference evaluations, narrowly outperformed by GPT-4o.
Human Evaluation: Using an extensive Elo ranking system based on human preferences, the Molmo family demonstrated competitive performance, with the MolmoE-1B model nearly matching GPT-4V and other models comfortably outperforming various proprietary systems.

Implications and Future Directions

The contributions of this paper have several implications for the field of AI and VLM research:

Openness and Reproducibility: By providing open weights, data, training, and evaluation code, the authors set a new standard for transparency in VLM research. This openness is expected to enhance reproducibility, foster innovation, and accelerate the development of advanced multimodal systems.
Data Collection Innovation: The speech-based data collection methodology could influence future dataset creation practices, emphasizing human-in-the-loop approaches to obtain high-quality training data without synthetic dependencies.
Application Potential: The demonstrated capabilities of Molmo models, particularly in natural interaction through pointing and counting, suggest significant application potential in diverse fields such as robotics, user interfaces, and real-world visual question answering.

Looking forward, the authors' release plan promises continuous updates and further detailed reports, alongside new datasets and improved model weights. This ongoing commitment will likely contribute to the sustained advancement in the capabilities and accessibility of open VLMs.

In summary, this paper represents a significant step towards building high-performance, open-weight VLMs, addressing critical challenges in the field, and setting a benchmark for future research initiatives. The Molmo and PixMo contributions not only advance technical capabilities but also embody a commitment to openness and community-driven progress in AI.