LLaVA-OneVision: Easy Visual Task Transfer
The research paper titled "LLaVA-OneVision: Easy Visual Task Transfer" explores the development of large multimodal models (LMMs) that can operate effectively across single-image, multi-image, and video scenarios. The authors present LLaVA-OneVision, a new family of open LMMs characterized by their versatility and ability to perform task transfer across multiple visual modalities.
Overview
LLaVA-OneVision consolidates various insights and techniques derived from the LLaVA-NeXT blog series. It aims to push the performance boundaries of open LMMs by leveraging a consolidated approach to data curation, modeling, and visual representation strategies. The model architecture connects vision encoders with LLMs through a minimalist connection module, facilitating strong transfer learning across different modalities.
Contributions
The paper makes several noteworthy contributions:
- Development of Large Multimodal Models: The authors develop LLaVA-OneVision, which improves the performance boundaries of open LMMs in single-image, multi-image, and video scenarios.
- Emerging Capabilities with Task Transfer: The design allows for strong task transfers, demonstrated through significant performance in video understanding and cross-scenario task transfer.
- Open-source Efforts: To support community efforts, the authors release the generated multimodal instruction data, the codebase, model checkpoints, and a visual chat demo.
Model Architecture
LLaVA-OneVision employs Qwen-2 as the LLM due to its strong language capabilities, SigLIP as the vision encoder, and a 2-layer MLP as the projector to map visual features into the word embedding space. The model processes a variety of visual inputs, including single images, multiple images, and video sequences, with strategies to balance computational resources and performance.
Visual Representations
A key innovation is the AnyRes strategy, which scales the resolution and the number of tokens to optimize performance across different visual scenarios. The strategy adapts the visual signal representation to the given task, ranging from high-resolution single images to multi-frame videos.
Data Curation
The paper emphasizes the importance of high-quality knowledge and visual instruction tuning data. They curate large datasets from multiple sources while prioritizing quality over quantity. The high-quality knowledge data includes re-captioned descriptions and OCR data, while the visual instruction tuning data spans single-image, multi-image, and video scenarios.
Training Strategies
The training process is divided into stages:
- Language-Image Alignment: Aligns the visual features with the word embedding space of the LLM.
- High-Quality Knowledge Learning: Integrates new, high-quality data into the LMM.
- Visual Instruction Tuning: Teaches the model to perform a diverse set of visual tasks through instruction tuning.
Experimental Results
Evaluations using LMMs-Eval demonstrate that LLaVA-OneVision achieves superior performance across a wide array of benchmarks in single-image, multi-image, and video scenarios. The largest model variant (72B parameters) yields competitive or superior results compared to both open-source and proprietary models like GPT-4V, particularly in complex tasks that require visual reasoning.
Conclusions and Future Directions
LLaVA-OneVision presents a significant advancement in building versatile LMMs capable of effective task transfer across visual modalities. The integration of high-quality data, innovative visual representation strategies, and a minimalist architecture enables strong performance in varied tasks. Looking forward, the research implies potential further improvements through scaling data and models, as well as exploring stronger LLMs. The open-source nature of the project will also facilitate future developments and applications in the broader AI community.