\modelname: Enhancing Vision Pretraining with Location-aware Capabilities
Introduction
Recent advancements in vision pretraining have primarily focused on leveraging large-scale annotated datasets for supervised learning or employing contrastive methods with web-mined image-text pairs. These approaches, while effective in zero-shot transfer tasks, often overlook the significance of spatial details in visual understanding. Addressing this gap, we introduce \modelname{}, a novel visual pretraining method that integrates location-aware captioning into the pretraining phase, aiming at a comprehensive understanding that includes both holistic image perception and precise location sensitivity. Our method utilizes a simple, yet effective, encoder-decoder architecture, enhancing the model's multitask learning capability with minimal computational overhead.
Location-aware Captioner Pretraining
\modelname{} proposes the integration of location information directly into visual representations using an auto-regressive decoder for multitask pretraining. Specifically, it adds two location-aware tasks: referring expression and grounded captioning, alongside traditional image captioning. This approach leverages the multitasking learning capabilities of decoders to process multiple location-conditioned tasks concurrently, enabling a shared encoder to learn refined visual representations that encompass both global image understanding and detailed spatial awareness. Our method demonstrates that by conditioning on task-specific prefixes, the model can flexibly switch between tasks, enriching the learned visual features with diverse contextual cues.
Experimental Setup
Our experiments utilize a subset of the WebLI dataset, enhanced with pseudo-annotations for object locations generated by an OWL-ViT model. Through careful training and evaluation, \modelname{} demonstrates superior performance on localization-sensitive tasks without compromising its competency in broader image understanding tasks. This balance illustrates the model's enhanced capacity to interpret images at both general and detailed levels, outperforming existing contrastive and generative pretraining methods on localization tasks while remaining competitive on holistic image understanding benchmarks.
Results and Discussion
\modelname{} outperforms standard captioning and contrastive pretraining models in tasks demanding precise spatial comprehension, such as referring expression comprehension and segmentation, showcasing its heightened sensitivity to object locations. Moreover, the model maintains robust performance in holistic image understanding tasks, reflecting its versatile visual representation capabilities. By embedding with a pretrained LLM, \modelname{} further extends its prowess to a range of vision-language tasks, consistently surpassing established baselines. These findings underscore the importance of incorporating location information into pretraining processes for achieving nuanced visual comprehension.
Conclusion and Future Directions
By integrating location-aware pretraining tasks within an encoder-decoder framework, \modelname{} significantly advances the field of visual pretraining. Our approach not only enhances the model's understanding of spatial relationships in images but also establishes a new benchmark for location-sensitive tasks. Future work might explore extending the capability of \modelname{} to more complex spatial reasoning tasks or refining the model's efficiency in multitask learning. The success of \modelname{} in balancing detailed spatial awareness with general image understanding holds promise for developing more comprehensive and contextually aware AI systems.
Acknowledgments
This research was supported by Google DeepMind and benefited from discussions and insights from various experts in the field. The use of the {\tt big_vision} codebase was instrumental in conducting the experiments.