Pixtral 12B (2410.07073v2)

Published 9 Oct 2024 in cs.CV and cs.CL

Abstract: We introduce Pixtral-12B, a 12--billion-parameter multimodal LLM. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B & Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-LLMs in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces Pixtral 12B, a 12-billion-parameter model that processes both images and text using an innovative RoPE-2D vision encoder and a long context window.
The paper demonstrates that Pixtral 12B outperforms larger models like Llama-3.2 90B in multimodal reasoning on benchmarks such as MM-MT-Bench.
The paper proposes standardized evaluation protocols with explicit prompting and flexible parsing to ensure consistent and fair assessments across multimodal tasks.

Pixtral 12B: A Comprehensive Overview

The paper introduces Pixtral 12B, a 12-billion-parameter multimodal LLM designed to process both natural images and text effectively. This development is noteworthy due to Pixtral’s ability to outperform models that are significantly larger in size while maintaining high performance in text-only tasks. Pixtral represents a substantial step forward in the integration of multimodal capabilities without diminishing text reasoning abilities.

Key Features

Pixtral 12B incorporates a host of innovations that contribute to its impressive performance:

Vision Encoder: A novel encoder trained from scratch using RoPE-2D encoding enables Pixtral to process images at native resolutions and aspect ratios. This flexibility allows it to function in various settings, from low latency to high-resolution, fine-grained tasks.
Long Context Window: The model can handle 128K tokens, accommodating multiple images in context, which enhances its utility in complex conversation scenarios.
Benchmark Performance: When evaluated on the newly introduced MM-MT-Bench, Pixtral surpasses other open models within its range and even outperforms larger models like Llama-3.2 90B in certain tasks.

Comparative Evaluation

Pixtral 12B’s performance was validated against a series of benchmarks, demonstrating its capabilities:

Multimodal Reasoning: Pixtral excels compared to both open-source models such as Llama-3.2 11B and Qwen2-VL 7B, and closed models like Claude-3 Haiku and Gemini-1.5 Flash 8B.
Text-Only Tasks: The model achieves admirable results on standard benchmarks like MATH and HumanEval, underscoring its robustness without compromising on either modality.

Evaluation Protocols

The paper highlights the variability and inconsistency in multimodal evaluation protocols. Pixtral addresses these by proposing explicit prompts and flexible parsing for evaluation, ensuring a more standardized and fair assessment across models.

Architectural Innovations

Pixtral 12B is constructed with a multimodal decoder adapted from Mistral Nemo 12B, paired with the advanced Pixtral-ViT vision encoder, which supports variable image sizes. This configuration supports complex reasoning and allows seamless switching between single-image and multi-image settings.

Practical Implications and Future Directions

The release of Pixtral 12B, under an Apache 2.0 license, opens new avenues for practical applications in multimodal AI, ranging from enhanced virtual assistants to sophisticated image-text synthesis scenarios. The associated open-source benchmark, MM-MT-Bench, establishes a new standard for evaluating multimodal models.

Looking forward, Pixtral 12B sets a precedent for future developments in AI that can seamlessly integrate multiple forms of data while optimizing performance and scalability.

In summary, Pixtral 12B represents a significant advancement in multimodal language modeling, with its innovative architecture and robust performance setting a new benchmark in the field.