- The paper introduces a generative pretrained vision encoder that uses a caption-only objective to simplify multimodal pretraining.
- The paper optimizes training by retaining 25–35% of visual tokens and leveraging high-quality synthetic captions, resulting in significant efficiency gains.
- The paper demonstrates that discarding contrastive loss in favor of generative captioning scales effectively to over 1B parameters and diverse multimodal tasks.
OpenVision 2: Generative Pretrained Visual Encoders for Multimodal Learning
Introduction
OpenVision 2 presents a streamlined approach to vision-language pretraining by discarding the text encoder and contrastive loss, relying exclusively on a generative captioning objective. This design is motivated by the need for computational efficiency and scalability in multimodal foundation models, particularly for researchers constrained by hardware resources. The architecture consists of a vision encoder (ViT variants) and a text decoder, trained to generate high-quality synthetic captions for images. The model leverages the ReCap-DataComp-1B v2 dataset, which provides diverse, grounded captions generated by LLaMA-3-powered LLaVA models conditioned on both images and alt-text.
Methodology
Architectural Simplification
OpenVision 2 eliminates the dual-branch pipeline of its predecessor, which required both contrastive and generative losses. The new pipeline is as follows:
- Image Encoding: Images are processed by a ViT-based encoder, producing a sequence of visual tokens.
- Token Masking: Approximately two-thirds of the visual tokens are randomly masked before being passed to the decoder, reducing computational load and regularizing the encoder.
- Caption Generation: The remaining tokens are input to a text decoder, which autoregressively generates the paired synthetic caption.
This approach aligns the pretraining objective with downstream multimodal tasks (e.g., LLaVA), facilitating smoother transfer and reducing objective mismatch.
Data and Training Strategies
- Synthetic Captioning: Training exclusively uses synthetic captions from ReCap-DataComp-1B v2, which are longer and more informative than raw web alt-text.
- CLIPA Curriculum: Pretraining is performed on low-resolution images, followed by brief high-resolution fine-tuning, yielding substantial speed-ups.
- Token Masking: Empirical ablations show that retaining 25–35% of visual tokens optimizes both efficiency and performance, especially on OCR and VQA benchmarks.
Comparison to Prior Work
- CapPa: OpenVision 2 improves upon CapPa by using higher-quality captions, simpler fusion (token concatenation), larger model/data scale, and pure autoregressive decoding.
- AIMv2: Unlike AIMv2, which blends pixel-level and text-level objectives and uses a prefix-ViT, OpenVision 2 employs a vanilla ViT and a caption-only objective, with more aggressive token masking and fully synthetic data.
Experimental Results
Multimodal Benchmarking
OpenVision 2 is evaluated under LLaVA-1.5 and Open-LLaVA-Next frameworks on tasks including TextVQA, ChartQA, OCR, MME, SEED, SQA, GQA, and POPE. Key findings:
- Performance Parity: OpenVision 2 matches or slightly exceeds OpenVision and other CLIP-style models across most benchmarks, with particularly strong results on OCR-intensive tasks.
- Scalability: The model scales efficiently to over 1B parameters and 12.8B image-caption pairs, maintaining robust performance at larger resolutions and batch sizes.
- Efficiency Gains: Training time is reduced by 1.5–2× and memory usage by 1.8× compared to OpenVision, enabling batch sizes up to 8k on TPU v4-64.
Ablation Studies
- Caption Source: Models trained on synthetic captions outperform those trained on raw alt-text by significant margins (e.g., +5.1 on TextVQA, +53 on OCR-Bench).
- Token Masking Ratio: Moderate masking (25–35%) yields optimal results, improving both efficiency and semantic representation.
Resource Requirements
- Hardware: All experiments utilize Google Cloud TPUs (v4-512 for training time, v4-64 for memory analysis).
- Batch Size: OpenVision 2 supports substantially larger batch sizes due to reduced memory footprint.
Implications and Future Directions
OpenVision 2 demonstrates that generative, caption-only pretraining is a viable and efficient alternative to contrastive learning for vision encoders in multimodal models. This challenges the prevailing assumption that CLIP-style contrastive objectives are necessary for scalable, general-purpose vision-LLMs. The release of code, checkpoints, and the ReCap-DataComp-1B v2 corpus provides a foundation for further research into generative paradigms.
Potential future directions include:
- Exploring Hybrid Objectives: Investigating combinations of generative and contrastive losses for specialized tasks.
- Scaling to Diverse Modalities: Extending the generative approach to video, audio, and temporal data.
- Data-Centric Improvements: Further refining synthetic captioning strategies and leveraging multilingual or domain-specific corpora.
- Efficient Deployment: Adapting the architecture for resource-constrained environments and edge devices.
Conclusion
OpenVision 2 introduces a simplified, generative-only vision encoder pretraining paradigm that achieves competitive multimodal performance while substantially reducing computational cost and memory requirements. The work provides strong empirical evidence that caption-only objectives can rival contrastive methods, especially when paired with high-quality synthetic data and efficient training strategies. This paradigm shift opens new avenues for scalable, open, and efficient multimodal foundation models.