Scaling Vision Pre-Training to 4K Resolution (2503.19903v1)

Published 25 Mar 2025 in cs.CV

Abstract: High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S² while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96x speedup over Qwen2-VL.

Summary

Insights into Scaling Vision Pre-Training to 4K Resolution

The paper introduces a novel approach, PS3 (Pre-training with Scale-Selective Scaling), for vision pre-training that extends the capabilities of CLIP-style models to handle 4K resolution images efficiently. This advance attempts to address the bottleneck of computational cost that typically scales quadratically with image resolution in CNNs and even more so in ViTs, which has historically limited vision models to much lower resolutions, such as 378x378 pixels.

Key Contributions

The PS3 method leverages a localized approach to contrastive learning by focusing on salient regions of an image and aligning these to detailed captions, bypassing the expensive task of processing the entirety of high-resolution images. By doing so, PS3 significantly reduces the computational overhead required to pre-train vision models at 4K resolution, achieving up to a 79x efficiency gain relative to traditional global contrastive approaches like SigLIP.

This selective processing enables PS3 not only to encode entire images at low resolution but also to selectively enhance high-resolution perception in response to text prompts, whether through top-down selection (where attention is text-prompt driven) or bottom-up selection (saliency-based). When incorporated into multi-modal LLMs (MLLMs), the resulting model, VILA-HD, exhibits marked improvements in handling high-resolution visual information, outperforming existing baselines such as AnyRes and S² while requiring significantly fewer tokens.

Empirical Performance

Empirical evaluations reveal that VILA-HD, equipped with a PS3 backbone, achieves superior performance and efficiency across multiple benchmarks compared to other MLLMs, including NVILA and Qwen2-VL. Particularly noteworthy is the development of a new benchmark, 4KPro, tailored for assessing models' competence in 4K-resolution image question answering. On 4KPro, VILA-HD demonstrates a substantial performance leap, improving by 14.5% over GPT-4o and achieving a 3.2% accuracy increase with nearly a threefold speedup compared to Qwen2-VL.

Theoretical and Practical Implications

The proposed method exhibits several intriguing properties worthy of discussion. The ability to scale resolution without incurring additional computational cost using a constant number of selected high-resolution patches showcases an innovative pathway toward more scalable and efficient high-resolution vision processing. Furthermore, the approach allows trading off between the extent of test-time compute and performance, offering flexible application based on resource constraints.

From a theoretical standpoint, PS3 challenges the conventional paradigm directed towards holistic image processing at higher resolutions, advocating for an intelligent, region-focused approach. Practically, this selective mechanism opens up new possibilities for integrating high-resolution visual data processing in real-world applications where computational resources are premium, such as autonomous vehicles or gaming environments.

Future Developments

Although PS3 marks a significant stride in high-resolution vision processing, its reliance on pre-determined bounding boxes and captions as part of the pre-training data signifies a first-generation solution. Future research could expand on this by considering how self-supervised learning mechanisms might further reduce the need for extensive labeled data, aligning high-resolution perception with evolving demands in artificial intelligence.

By addressing these challenges within the PS3 framework and exploring the potential of integrating with newer modalities like video, the method may further advance towards a universal approach for scalable vision pre-training. The versatility of PS3 in adapting to various domains, particularly those demanding precision at high resolutions, promises to play a substantial role in the evolution of AI vision systems.