PaliGemma: A versatile 3B VLM for transfer (2407.07726v2)

Published 10 Jul 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: PaliGemma is an open Vision-LLM (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B LLM. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

PDF HTML Abstract

An Overview of PaliGemma: A Versatile Vision-LLM

PaliGemma presents a noteworthy contribution to the field of Vision-LLMs (VLMs). This model is an open, versatile 3 billion-parameter VLM designed for effective transfer to a broad spectrum of tasks. Developed by integrating the SigLIP-So400m vision encoder and the Gemma-2B LLM, PaliGemma represents a convergence of advanced vision encoding and LLMing to achieve robust performance across diverse applications.

Architectural Foundation

PaliGemma builds upon the foundation laid by the PaLI series of vision-LLMs. It combines a SigLIP image encoder with a Gemma decoder-only LLM. SigLIP, optimized with a shape-focused ViT architecture, serves as the vision encoder, benefiting from large-scale contrastive pretraining. The language decoding is handled by the Gemma-2B model, which is pretrained but not instruction-tuned, ensuring flexible downstream adaptation. The integration is facilitated by a linear projection layer that aligns the dimensional outputs of the SigLIP with the inputs of Gemma.

Pretraining and Scaling

The pretraining strategy encompasses several stages, beginning with leveraging publicly available checkpoints for unimodal pretraining of vision and language components. This is followed by extensive multimodal pretraining (Stage1) on a carefully curated mixture of tasks. The model is initially trained at a 224px resolution, and subsequently, the resolution is progressively scaled to 448px and 896px during short, focused pretraining stages (Stage2). This methodology ensures that the model captures fine-grained visual details crucial for high-resolution tasks.

Key architecture decisions include adopting a prefix-LM strategy for auto-regressive masking, which allows bidirectional attention during "thinking" phases while preserving causal dependencies. The model is trained without freezing any components, which diverges from conventional practices and enhances its ability to assimilate broader visual and relational information during multimodal pretraining.

Empirical Evaluation

PaliGemma exhibits robust performance across nearly 40 diverse tasks, ranging from standard benchmarks like COCO Captions and VQAv2 to specialized domains such as remote sensing and video captioning. Strong numerical results underscore its efficacy: for instance, it achieves 141.9 CIDEr on COCOcap and 83.2% accuracy on VQAv2 at 224px resolution. The model's performance is robust even when employing simplified hyper-parameters or limited transfer examples, demonstrating it as a pragmatic base model for versatile applications.

Ablation Studies and Insights

Extensive ablation studies reveal several insights:

Pretraining Duration: Longer multimodal pretraining (1 billion examples) ensures broad visual knowledge, while shorter durations significantly degrade performance.
Training Objectives: Prefix-LM and focused supervision on suffix tokens emerge as optimal settings.
Image Encoder Flexibility: While removing the image encoder in a Fuyu-style setup is less efficient in current configurations, it hints at a promising direction for future research.

Implications and Future Directions

PaliGemma's development emphasizes creating a baseline VLM that is highly transferable rather than a zero-shot generalist. This approach enables the model to adapt efficiently to new tasks with minimal examples, broadening its utility in real-world applications such as robotics, autonomous navigation, and AI-driven data analysis in specialized fields. The model's ability to handle high-resolution images and complex visual queries points to practical implementations where detailed visual understanding is pivotal.

From a theoretical standpoint, PaliGemma presents an exemplar for integrating vision and LLMs, challenging the status quo of frozen models during multimodal pretraining. The implications of task-specific fine-tuning and dynamic resolution handling offer future research avenues in optimizing VLM architectures tailored for specific applications.

In conclusion, PaliGemma represents a significant evolution in the design and scalability of Vision-LLMs, offering practical applications and theoretical advancements. Its robust performance, even with smaller model size and parameter counts, positions it as a potent tool for future AI developments in both academic research and industry applications.