LocCa: Visual Pretraining with Location-aware Captioners (2403.19596v2)

Published 28 Mar 2024 in cs.CV

Abstract: Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa). LocCa uses a simple image captioner task interface, to teach a model to read out rich information, i.e. bounding box coordinates, and captions, conditioned on the image pixel input. Thanks to the multitask capabilities of an encoder-decoder architecture, we show that an image captioner can easily handle multiple tasks during pretraining. Our experiments demonstrate that LocCa outperforms standard captioners significantly on localization downstream tasks while maintaining comparable performance on holistic tasks.

PDF HTML Abstract

\modelname: Enhancing Vision Pretraining with Location-aware Capabilities

Introduction

Recent advancements in vision pretraining have primarily focused on leveraging large-scale annotated datasets for supervised learning or employing contrastive methods with web-mined image-text pairs. These approaches, while effective in zero-shot transfer tasks, often overlook the significance of spatial details in visual understanding. Addressing this gap, we introduce \modelname{}, a novel visual pretraining method that integrates location-aware captioning into the pretraining phase, aiming at a comprehensive understanding that includes both holistic image perception and precise location sensitivity. Our method utilizes a simple, yet effective, encoder-decoder architecture, enhancing the model's multitask learning capability with minimal computational overhead.

Location-aware Captioner Pretraining

\modelname{} proposes the integration of location information directly into visual representations using an auto-regressive decoder for multitask pretraining. Specifically, it adds two location-aware tasks: referring expression and grounded captioning, alongside traditional image captioning. This approach leverages the multitasking learning capabilities of decoders to process multiple location-conditioned tasks concurrently, enabling a shared encoder to learn refined visual representations that encompass both global image understanding and detailed spatial awareness. Our method demonstrates that by conditioning on task-specific prefixes, the model can flexibly switch between tasks, enriching the learned visual features with diverse contextual cues.

Experimental Setup

Our experiments utilize a subset of the WebLI dataset, enhanced with pseudo-annotations for object locations generated by an OWL-ViT model. Through careful training and evaluation, \modelname{} demonstrates superior performance on localization-sensitive tasks without compromising its competency in broader image understanding tasks. This balance illustrates the model's enhanced capacity to interpret images at both general and detailed levels, outperforming existing contrastive and generative pretraining methods on localization tasks while remaining competitive on holistic image understanding benchmarks.

Results and Discussion

\modelname{} outperforms standard captioning and contrastive pretraining models in tasks demanding precise spatial comprehension, such as referring expression comprehension and segmentation, showcasing its heightened sensitivity to object locations. Moreover, the model maintains robust performance in holistic image understanding tasks, reflecting its versatile visual representation capabilities. By embedding with a pretrained LLM, \modelname{} further extends its prowess to a range of vision-language tasks, consistently surpassing established baselines. These findings underscore the importance of incorporating location information into pretraining processes for achieving nuanced visual comprehension.

Conclusion and Future Directions

By integrating location-aware pretraining tasks within an encoder-decoder framework, \modelname{} significantly advances the field of visual pretraining. Our approach not only enhances the model's understanding of spatial relationships in images but also establishes a new benchmark for location-sensitive tasks. Future work might explore extending the capability of \modelname{} to more complex spatial reasoning tasks or refining the model's efficiency in multitask learning. The success of \modelname{} in balancing detailed spatial awareness with general image understanding holds promise for developing more comprehensive and contextually aware AI systems.

Acknowledgments

This research was supported by Google DeepMind and benefited from discussions and insights from various experts in the field. The use of the {\tt big_vision} codebase was instrumental in conducting the experiments.