LLaVA-OneVision: Easy Visual Task Transfer
Abstract: We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Glossary
- Ablation: An experimental analysis where components or settings are systematically removed or varied to assess their impact on performance. "Please see our detailed ablations of visual representation in~\cite{li2024llavanext-ablations}."
- AnyRes: A visual encoding scheme that processes images at arbitrary resolutions by dividing them into crops suited for the encoder. "AnyRes for handling high-resolution images"
- Auto-regressive: A modeling approach where each token is predicted conditioned on previously generated tokens in a sequence. "For the conditionals in~\eqref{eq:auto_regressive}, we explicitly add ${_{v}$ to emphasize the fact that the visual signal is grounded for all answers."
- Bilinear interpolation: A method for resampling or resizing data (e.g., feature maps) using linear interpolation in two dimensions. "Bilinear interpolation is employed to reduce the number of tokens, allowing the consideration of a larger number of frames by reducing tokens per frame."
- Curriculum learning: A training strategy that presents tasks or examples in increasing order of difficulty to improve learning efficiency and stability. "We train the model via a curriculum learning principle, where training objectives and examples of increasing difficulty are observed in a stage-wise manner."
- Feature maps: Structured arrays of features produced by a vision encoder or convolutional layers that spatially represent visual information. "Only the base image resolution is considered and fed into the vision encoder to obtain feature maps, eliminating the need for multi-crop of high resolution image and thus saving computational resources"
- Grid features: Spatially organized features (often from transformer or CNN layers) sampled on a regular grid across the image. "The grid features before and after the last Transformer layer are considered in our experiments."
- Instruction tuning: Fine-tuning a model on datasets where inputs include instructions and desired responses to improve its ability to follow tasks. "Visual instruction tuning refers to the capability of an LMM to understand and act upon visual instructions."
- LLM: A neural network trained on large-scale text corpora to perform language understanding and generation tasks. "As a cost-efficient recipe, it is typically developed by connecting vision encoders with LLMs using a simple connection module."
- Large Multimodal Models (LMMs): Models that process and integrate multiple data modalities (e.g., text, images, video) for unified reasoning and generation. "We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series."
- Modality transfer: Leveraging capabilities learned in one data modality (e.g., images) to perform tasks in another (e.g., video). "The Video blog~\cite{zhang2024llavanext-video} shows that the image-only-trained LLaVA-NeXT model is surprisingly strong on video tasks with zero-shot modality transfer, due to the design of AnyRes to digest any vision signals as a sequence of images."
- Multimodal pre-training: Pre-training models on large datasets containing multiple modalities to build general cross-modal representations. "The web-scale public image-text data is often of low-quality, rendering the data scaling of multimodal pre-training less efficient."
- Optical Character Recognition (OCR): Technology for detecting and transcribing text from images or documents. "We used this text reading data along with the SynDOG EN/CN, to form the Document / OCR Data, totaling 1.1M samples."
- Projector: A learned module (often an MLP) that maps visual features into the LLM’s embedding space as token-like inputs. "{\it Projector}. We consider a 2-layer MLP~\cite{liu2024improved} parameterized by _{v} = p(_{v})$." - **SigLIP**: A vision encoder model trained with a sigmoid loss, used to extract visual features from images. "We consider the SigLIP~\cite{zhai2023sigmoid} as the visual encoder $g_{}(\cdot), encoding an input image ${_{v}$ into its visual feature $_{v} = g({_{v})$."
- Token (visual token): A discrete unit representing information; visual tokens are embeddings of image features passed to an LLM. "The maximum number of visual tokens across different scenarios is designed to be similar, ensuring balanced visual representations to accommodate cross-scenario capability transfer."
- Transfer learning: Adapting a model trained on one task or domain to another, leveraging previously learned representations. "Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities."
- Vision encoder: A neural network that converts raw images or frames into feature representations usable by downstream models. "Each frame of the video is resized to the base image resolution and processed by the vision encoder to generate feature maps."
- Vision Transformer (ViT): A transformer-based architecture for image understanding that treats image patches as tokens. "This principle is paramount due to the extensive knowledge stored within pre-trained LLMs and Vision Transformers (ViTs)."
- Word embedding space: The continuous vector space in which tokens of a LLM are represented. "to project image features into the word embedding space"
- Zero-shot modality transfer: Applying a model trained on one modality to another without additional fine-tuning, relying on generalization. "strong on video tasks with zero-shot modality transfer"
Collections
Sign up for free to add this paper to one or more collections.