Locality Alignment Improves Vision-Language Models (2410.11087v2)

Published 14 Oct 2024 in cs.CV

Abstract: Vision LLMs (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We hypothesize that this is due to VLMs adopting pre-trained vision backbones, specifically vision transformers (ViTs) trained with image-level supervision and minimal inductive biases. Such models may fail to encode the class contents at each position in the image, and our goal is to resolve this with a vision backbone that effectively captures both local and global image semantics. Our main insight is that we do not require new supervision to learn this capability - pre-trained models contain significant knowledge of local semantics that we can extract and use for scalable self-supervision. We propose a new efficient post-training stage for ViTs called locality alignment and a novel fine-tuning procedure called MaskEmbed that uses a masked reconstruction loss to learn semantic contributions for each image patch. We first evaluate locality alignment with a vision-only benchmark, finding that it improves a model's performance at patch-level semantic segmentation, especially for strong backbones trained with image-caption pairs (e.g., CLIP and SigLIP). We then train a series of VLMs with and without locality alignment, and show that locality-aligned backbones improve performance across a range of benchmarks, particularly ones that involve spatial understanding (e.g., RefCOCO, OCID-Ref, TallyQA, VSR, AI2D). Overall, we demonstrate that we can efficiently learn local semantic extraction via a locality alignment stage, and that this procedure benefits VLM training recipes that use off-the-shelf vision backbones.

Summary

The paper introduces MaskEmbed as a post-training method to enhance local semantic extraction in vision-language models.
It applies a masked reconstruction loss to learn both local and global features, significantly improving spatial reasoning.
Experimental results on benchmarks like RefCOCO and TallyQA show notable gains, especially for models such as CLIP and SigLIP.

Locality Alignment Improves Vision-LLMs

The research presented in "Locality Alignment Improves Vision-LLMs" addresses a fundamental challenge in vision-LLMs (VLMs): enhancing spatial reasoning abilities. Current VLMs, despite their flexibility and adoption, frequently encounter issues with spatial understanding, attributed to the limitations of their pre-trained Vision Transformer (ViT) backbones. These ViTs are typically trained with image-level supervision that inadequately captures local semantic details within images, which are crucial for spatial tasks.

Core Contributions and Methodology

The central contribution of the paper is the introduction of a locality alignment stage using a technique called MaskEmbed, which is positioned as a post-training step for ViTs. MaskEmbed utilizes a masked reconstruction loss to enable the extraction of both local and global semantics from images, without requiring additional annotated data. The approach leverages self-supervision to learn representations that disentangle local semantic information effectively, complementing existing VLM training pipelines.

The MaskEmbed process involves probing pre-trained models with various masked inputs to understand the local semantic contributions of each patch. By doing so, it refines the vision backbone, enabling improved semantic segmentation and spatial understanding in downstream tasks. The approach is computationally efficient, requiring less than 1% of the compute resources needed for the original pre-training of models like CLIP and SigLIP.

Experimental Validation

The paper presents extensive experiments utilizing a vision-centric probing benchmark. These experiments demonstrate that locality-aligned ViTs significantly enhance performance in patch-level semantic classification tasks compared to traditional image-level supervised models. Notably, this improvement is more pronounced for models such as CLIP and SigLIP, which are widely used due to their strong baseline performance in VLMs.

Further experiments involve training a series of VLMs with and without locality alignment, utilizing the Prismatic library. The findings indicate that locality alignment provides consistent improvements in benchmarks that require spatial comprehension, such as RefCOCO, OCID-Ref, and TallyQA. These enhancements are observed across different datasets and architectures, further validating the efficacy of the proposed method.

Implications and Future Directions

The implications of this work are manifold. Practically, it suggests that enhancing the spatial reasoning capabilities of VLMs does not necessitate significant architectural modifications or data acquisition but can be achieved through efficient post-training refinement. Theoretically, it provides insights into how existing pre-trained models can be adapted to encode localized semantics more effectively, highlighting a gap in the current pre-training paradigms.

Looking forward, several promising avenues are suggested for future research. These include exploring locality alignment in wider VLM architectures, experimenting with larger datasets for more diverse semantic extraction, and integrating this approach into the pre-training phases of even more advanced models. Additionally, combining locality alignment with other methods like multi-crop features and higher image resolutions could yield further performance gains.

In conclusion, the research offers a valuable perspective on refining existing models to better handle spatial reasoning tasks, with implications that extend to various applications requiring nuanced image understanding. This work stands as a testament to the potential of leveraging existing knowledge within pre-trained models through innovative post-training strategies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/MuzafferKal_/status/1846436913131278716

https://twitter.com/jbohnslav/status/1846554846926664038