Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 40 tok/s

GPT-5 High 38 tok/s Pro

GPT-4o 101 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 161 tok/s Pro

2000 character limit reached

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning (2509.01644v1)

Published 1 Sep 2025 in cs.CV

Abstract: This paper provides a simplification on OpenVision's architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model's performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.

Collections

Summary

The paper introduces a generative pretrained vision encoder that uses a caption-only objective to simplify multimodal pretraining.
The paper optimizes training by retaining 25–35% of visual tokens and leveraging high-quality synthetic captions, resulting in significant efficiency gains.
The paper demonstrates that discarding contrastive loss in favor of generative captioning scales effectively to over 1B parameters and diverse multimodal tasks.

OpenVision 2: Generative Pretrained Visual Encoders for Multimodal Learning

Introduction

OpenVision 2 presents a streamlined approach to vision-language pretraining by discarding the text encoder and contrastive loss, relying exclusively on a generative captioning objective. This design is motivated by the need for computational efficiency and scalability in multimodal foundation models, particularly for researchers constrained by hardware resources. The architecture consists of a vision encoder (ViT variants) and a text decoder, trained to generate high-quality synthetic captions for images. The model leverages the ReCap-DataComp-1B v2 dataset, which provides diverse, grounded captions generated by LLaMA-3-powered LLaVA models conditioned on both images and alt-text.

Methodology

Architectural Simplification

OpenVision 2 eliminates the dual-branch pipeline of its predecessor, which required both contrastive and generative losses. The new pipeline is as follows:

Image Encoding: Images are processed by a ViT-based encoder, producing a sequence of visual tokens.
Token Masking: Approximately two-thirds of the visual tokens are randomly masked before being passed to the decoder, reducing computational load and regularizing the encoder.
Caption Generation: The remaining tokens are input to a text decoder, which autoregressively generates the paired synthetic caption.

This approach aligns the pretraining objective with downstream multimodal tasks (e.g., LLaVA), facilitating smoother transfer and reducing objective mismatch.

Data and Training Strategies

Synthetic Captioning: Training exclusively uses synthetic captions from ReCap-DataComp-1B v2, which are longer and more informative than raw web alt-text.
CLIPA Curriculum: Pretraining is performed on low-resolution images, followed by brief high-resolution fine-tuning, yielding substantial speed-ups.
Token Masking: Empirical ablations show that retaining 25–35% of visual tokens optimizes both efficiency and performance, especially on OCR and VQA benchmarks.

Comparison to Prior Work

CapPa: OpenVision 2 improves upon CapPa by using higher-quality captions, simpler fusion (token concatenation), larger model/data scale, and pure autoregressive decoding.
AIMv2: Unlike AIMv2, which blends pixel-level and text-level objectives and uses a prefix-ViT, OpenVision 2 employs a vanilla ViT and a caption-only objective, with more aggressive token masking and fully synthetic data.

Experimental Results

Multimodal Benchmarking

OpenVision 2 is evaluated under LLaVA-1.5 and Open-LLaVA-Next frameworks on tasks including TextVQA, ChartQA, OCR, MME, SEED, SQA, GQA, and POPE. Key findings:

Performance Parity: OpenVision 2 matches or slightly exceeds OpenVision and other CLIP-style models across most benchmarks, with particularly strong results on OCR-intensive tasks.
Scalability: The model scales efficiently to over 1B parameters and 12.8B image-caption pairs, maintaining robust performance at larger resolutions and batch sizes.
Efficiency Gains: Training time is reduced by 1.5–2× and memory usage by 1.8× compared to OpenVision, enabling batch sizes up to 8k on TPU v4-64.

Ablation Studies

Caption Source: Models trained on synthetic captions outperform those trained on raw alt-text by significant margins (e.g., +5.1 on TextVQA, +53 on OCR-Bench).
Token Masking Ratio: Moderate masking (25–35%) yields optimal results, improving both efficiency and semantic representation.

Resource Requirements

Hardware: All experiments utilize Google Cloud TPUs (v4-512 for training time, v4-64 for memory analysis).
Batch Size: OpenVision 2 supports substantially larger batch sizes due to reduced memory footprint.

Implications and Future Directions

OpenVision 2 demonstrates that generative, caption-only pretraining is a viable and efficient alternative to contrastive learning for vision encoders in multimodal models. This challenges the prevailing assumption that CLIP-style contrastive objectives are necessary for scalable, general-purpose vision-LLMs. The release of code, checkpoints, and the ReCap-DataComp-1B v2 corpus provides a foundation for further research into generative paradigms.

Potential future directions include:

Exploring Hybrid Objectives: Investigating combinations of generative and contrastive losses for specialized tasks.
Scaling to Diverse Modalities: Extending the generative approach to video, audio, and temporal data.
Data-Centric Improvements: Further refining synthetic captioning strategies and leveraging multilingual or domain-specific corpora.
Efficient Deployment: Adapting the architecture for resource-constrained environments and edge devices.

Conclusion

OpenVision 2 introduces a simplified, generative-only vision encoder pretraining paradigm that achieves competitive multimodal performance while substantially reducing computational cost and memory requirements. The work provides strong empirical evidence that caption-only objectives can rival contrastive methods, especially when paired with high-quality synthetic data and efficient training strategies. This paradigm shift opens new avenues for scalable, open, and efficient multimodal foundation models.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (7)

Tweets

https://twitter.com/cihangxie/status/1963297223753494832

https://twitter.com/jiqizhixin/status/1963442911108084161

https://twitter.com/BoLi68567011/status/1963525691590267263

alphaXiv

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning (48 likes, 0 questions)