OpenVision 2: Generative Visual Encoder

Updated 3 September 2025

OpenVision 2 is a generative visual encoder that simplifies multimodal learning by eliminating the traditional text encoder and contrastive loss.
Its architecture utilizes an autoregressive captioning loss and token masking to significantly reduce training time and memory usage while enabling larger backbones.
Benchmark evaluations on tasks like TextVQA demonstrate that OpenVision 2 achieves comparable or superior performance with improved efficiency over previous models.

OpenVision 2 is a family of generative pre-trained visual encoders for multimodal learning that introduces a significant simplification over previous architectures by eliminating the text encoder and associated contrastive loss. Designed for integration into vision-language systems and multimodal foundation models, OpenVision 2 relies exclusively on a captioning loss, yielding enhanced training efficiency and enabling scaling to substantially larger vision encoder backbones. Initial experiments demonstrate that despite this simplification, OpenVision 2 achieves performance that is on par with or surpasses the original OpenVision models across a suite of multimodal benchmarks, while substantially reducing both training time and memory requirements (Liu et al., 1 Sep 2025).

1. Architectural Simplification

OpenVision 2 deliberately removes the text encoder from the conventional two-tower architecture employed in vision-language pretraining frameworks such as CLIP and OpenVision. In prior vision-LLMs, the standard approach utilizes (i) a vision encoder that produces visual embeddings and (ii) a text encoder that produces textual embeddings, which are aligned through a contrastive loss. OpenVision 2 eliminates the text encoder and contrastive objective, resulting in a two-module system: a vision encoder and a text decoder.

During pretraining, input images are tokenized by the vision encoder and the resulting visual tokens are fed into a text decoder responsible for generating captions. The sole objective is the autoregressive captioning (generative) loss: $\mathcal{L}_{\text{caption}} = -\sum_{t} \log p(c_t \mid v, c_1, ..., c_{t-1})$ where $v$ is the set of visual tokens (possibly with random token masking for efficiency) and $c_t$ denotes tokens of the target synthetic captions. This alignment with generative pretraining strategies directly mirrors modern multimodal approaches such as CapPa, AIMv2, and the LLaVA family.

2. Training Efficiency and Resource Requirements

The architectural modification leads to marked improvements in computational efficiency and scalability. Empirically, with a ViT-L/14 backbone at 224×224 resolution, OpenVision 2 reduces end-to-end training time from 83 hours to 57 hours—a 1.5× improvement—relative to the original OpenVision. Peak device memory drops from 24.5GB to 13.8GB (≈1.8× lower), supporting a fourfold increase in maximum batch size (from 2k to 8k).

At higher scales, such as with SoViT-400M/14 at 384×384 resolution, the wall-clock time drops from 241 hours to 121 hours and device memory reduces from 27.4GB to 14.5GB. These efficiency gains stem from the elimination of the contrastive pipeline and the implementation of masked visual token input, wherein two-thirds of encoder output tokens are randomly dropped before the caption decoder. This masking further decreases computation while regularizing the training procedure.

A consequence of the lower per-image FLOPs (e.g., from 271.75 to 208.90 for ViT-L/14 at 224 resolution) is that much larger-backed vision encoders become feasible. OpenVision 2 has been demonstrated with backbones exceeding one billion parameters (e.g., the g/14 variant with 1.01B parameters), substantially extending the scaling frontier.

3. Generative Training Paradigm

By relying exclusively on a captioning loss, OpenVision 2 adopts a strictly generative pretraining approach. The generative-only paradigm aligns pretraining and finetuning objectives for downstream multimodal tasks, notably reducing discrepancies that may arise from contrastive pretraining. This design is consistent with trends in recent vision-LLMs (e.g., CapPa, AIMv2, and autoregressive variants of LLaVA).

The generative loss is supported by high-quality synthetic captions, generated using the ReCap-DataComp-1B v2 pipeline, which provides more comprehensive supervision compared to noisy web captions historically used in vision-language pretraining. The model forgoes complex matching between image and textual modalities in the latent space, placing the burden of semantic grounding entirely on the generative modeling capacity of the encoder-decoder architecture.

This approach is not without trade-offs: discarding contrastive objectives may reduce the model’s explicit cross-modal alignment supervision, possibly affecting robustness in some zero-shot settings. However, OpenVision 2 demonstrates that, when paired with sufficiently informative text targets and architectural scalability, this is not a limiting factor for the majority of downstream benchmarks (Liu et al., 1 Sep 2025).

4. Benchmark Performance

OpenVision 2 is evaluated in integration with established multimodal frameworks such as LLaVA-1.5 and Open-LLaVA-Next, under both frozen-feature and fully end-to-end finetuning regimes. Benchmarks encompass TextVQA, ChartQA, OCR, MME, SEED, and SQA.

Reported results show that, for instance, with ViT-L/14 at 224×224 resolution, the model achieves 68.9 on TextVQA (versus 68.3 for the original OpenVision) and maintains parity on OCR (537 vs. 547), indicating negligible loss from the elimination of contrastive pretraining. Across a suite of eight to ten multimodal tasks, the performance is competitive with or superior to original OpenVision and other contemporary generative-only models such as CapPa and AIMv2.

Efficiency gains do not come at the cost of effectiveness; rather, the architecture enables investigation of larger-scale backbones and input resolutions without incurring prohibitive training times or device constraints.

5. Scalability and Model Variants

The computational and architectural efficiency of OpenVision 2 permits scaling to vision encoders in the 1B+ parameter range, a domain that was previously impractical with dual-tower contrastive approaches given hardware limitations. For example, the g/14 variant attains 1.01B parameters while adhering to actionable training budgets.

Training optimizations such as visual token masking and per-stage curriculum (progressing from low to high input resolutions) further reinforce scalability. This design opens the possibility of future generative vision encoders surpassing current size and complexity limits, broadening their applicability to both high-resource server deployment and adaptation to large multimodal models.

6. Implications for Multimodal Foundation Models

The generative-only paradigm exemplified by OpenVision 2 directly supports the trend of alignment between vision pretraining and downstream multimodal generative tasks. This paradigm reduces the mismatch between pretraining (caption generation) and downstream objectives (question answering, document interpretation, multi-turn dialog, etc.), which are themselves typically generative.

A plausible implication is that future vision encoder training for foundation models will continue to shift towards generative-only losses, leveraging large-scale synthetic text corpora for supervision and regularization strategies such as token masking for efficiency. This suggests greater homogeneity in training methods between pretraining and task-specific finetuning, and may reduce complexity in pipeline design and deployment.

7. Summary Table: Efficiency and Benchmark Comparison

Model	Training Time (ViT-L/14 @224)	Peak Memory	TextVQA Score	Max Batch Size
OpenVision	83h	24.5GB	68.3	2k
OpenVision 2	57h	13.8GB	68.9	8k

The tabulated results illustrate the key empirical findings: OpenVision 2 achieves improved efficiency and comparable benchmark performance with larger practical batch sizes.

OpenVision 2 introduces and validates a generative-only visual encoder architecture that achieves state-of-the-art multimodal performance with dramatically improved training efficiency and scalability. Its simplifications and empirical validation position it as a compelling foundation for future research and deployment in vision-language foundation models (Liu et al., 1 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to OpenVision 2.