FLAIR: VLM with Fine-grained Language-informed Image Representations (2412.03561v1)

Published 4 Dec 2024 in cs.CV and cs.AI

Abstract: CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-LLM to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-LLMs' ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs. Code is available at https://github.com/ExplainableML/flair .

PDF HTML Abstract

This paper introduces FLAIR (Fine-grained Language-informed Image Representations), a Vision-LLM (VLM) designed to overcome the limitations of models like CLIP in understanding fine-grained visual details. While CLIP excels at aligning images and text globally, it often struggles to capture localized information, hindering performance in tasks requiring detailed understanding.

Problem Addressed:

Standard VLMs like CLIP learn coarse-grained alignments between entire images and text captions. This global matching approach leads to a loss of local details, making it difficult to distinguish or localize specific objects, attributes, or regions mentioned in a text query (as illustrated in Figure 1).

FLAIR's Approach:

FLAIR aims to learn image representations that are sensitive to fine-grained textual descriptions. Key components include:

Leveraging Detailed Captions: FLAIR utilizes datasets with long, descriptive captions (often synthetically generated by Multimodal LLMs - MLLMs). These captions provide rich semantic information about various aspects of an image, including specific objects, attributes, and spatial relationships.
Diverse Caption Sampling: Instead of using the full long caption, FLAIR employs a strategy to sample multiple ( $K$ ) sub-captions for each image. Each sub-caption consists of a varying number ( $s$ ) of sentences (e.g., 1 to 3 sentences) sampled either consecutively or randomly from the original long caption. This creates a mix of local (short sub-captions focusing on details) and global (longer sub-captions describing more of the scene) textual descriptions for each image.
Text-Conditioned Attention Pooling: This is the core architectural innovation. FLAIR introduces an attention pooling mechanism where the global text embedding ( $\mathbf{t}^{\text{g}}$ $t^{g}$ ) of a sub-caption acts as a query to pool information from the local image patch tokens ( $\mathbf{v}^{\text{loc}}$ $v^{loc}$ ). This produces a text-conditioned image representation ( $\mathbf{v}^{\text{tc}}$ $v^{tc}$ ) that is specifically tailored to the semantics of the input text query.
- The pooling is defined as: $\mathbf{v}^{\text{tc}} = f_{\text{AttnPool}}(\mathbf{t}^{\text{g}}, \mathbf{v}^{\text{loc}}) = \text{softmax} \left( \frac{\mathbf{t}^{\text{g}} W_q (\mathbf{v}^{\text{loc}} W_k)^T}{\sqrt{d}} \right) \mathbf{v}^{\text{loc}} W_v$ . A multi-head attention layer is used in practice, and an empty token is added to allow the text query to attend to nothing if irrelevant.
Careful Negative Pair Selection: For the contrastive loss involving the text-conditioned image embeddings, FLAIR defines positive pairs as $\langle \mathbf{v}^{\text{tc}_{i,i_k}}, \mathbf{t}^{\text{g}_{i_k} \rangle$ (image $i$ , conditioned on its own $k$ -th caption, compared with that same caption). Crucially, it defines negative pairs as $\langle \mathbf{v}^{\text{tc}_{i,j_k}}, \mathbf{t}^{\text{g}_{j_k} \rangle$ (image $i$ , conditioned on caption $k$ from a different image $j$ , compared with that same caption $j_k$ ). This avoids a shortcut where the model could simply match the conditioning text with the comparison text, forcing it to use image information.
Combined Loss Function: FLAIR uses a combination of two sigmoid-based contrastive losses (inspired by SigLIP, which handles multiple positives well):
- Text-Conditioned Sigmoid Loss ( $\mathcal{L}^{\text{tcs}}$ ): Aligns the text-conditioned image embedding ( $\mathbf{v}^{\text{tc}_{i,j_k}}$ ) with the corresponding text embedding ( $\mathbf{t}^{\text{g}_{j_k}}$ ). This loss drives the fine-grained alignment.
- Multi-Positive Sigmoid Loss ( $\mathcal{L}^{\text{mps}}$ ): Aligns the global image embedding ( $\mathbf{v}^{\text{g}_i}$ ) with all its corresponding sampled sub-caption embeddings ( $\mathbf{t}^{\text{g}_{i_k}}$ ). This helps maintain coarse-grained understanding.
- The final loss is $\mathcal{L} = \frac{1}{2}(\mathcal{L}^{\text{tcs}} + \mathcal{L}^{\text{mps}})$ .

Implementation Details:

Base Model: Built upon OpenCLIP, using ViT-B/16 image encoder and standard text transformer encoder (77-token limit).
Training Data: Pre-trained primarily on re-captioned datasets from DreamLIP (CC3M-recap, CC12M-recap, YFCC15M-recap, totaling 30M pairs). Ablations also show effectiveness on original CC3M and PixelProse datasets.
Training Setup: Used AdamW optimizer, cosine decay learning rate schedule, batch sizes from 1k to 6k. Sampled $K=8$ sub-captions per image, each merging $s=1$ to $S=3$ sentences randomly.
Inference: For retrieval, image embeddings are conditioned on each potential text candidate before computing similarity. For segmentation, similarity is computed between local image tokens ( $\mathbf{v}^{\text{loc}}$ ) and class name embeddings ( $\mathbf{t}^{\text{g}}$ ), either directly (FLAIR-CLIP style) or using the text-conditioned embeddings (FLAIR-TC style).

Evaluation and Key Results:

FLAIR was evaluated on several tasks against CLIP, SigLIP, DreamLIP (trained on the same 30M data), and SOTA models trained on billions of pairs (OpenCLIP-2B, SigLIP-10B, etc.).

Standard Retrieval (MSCOCO, Flickr30k): FLAIR significantly outperforms baselines trained on the same data (e.g., +7.9% T2I R@1 on COCO vs. DreamLIP) and surpasses or matches models trained on billions of pairs (e.g., FLAIR-30M beats SigLIP-10B).
Fine-grained Retrieval (DOCCI-FG, IIW-FG - new benchmarks): FLAIR excels, demonstrating superior ability to match images with captions describing specific details. It outperforms DreamLIP-30M by 3.4%-7.8% R@1 and even beats SigLIP-10B.
Long Retrieval (DCI, SV-1k/10k, Urban-1k): Despite using a standard 77-token text encoder, FLAIR-30M outperforms specialized long-caption models like Long-CLIP and LoTLIP (trained on 100M-400M data) on several benchmarks (e.g., +10.4% T2I R@1 on SV-1k vs LoTLIP). This is attributed to the text-conditioned pooling adapting to rich semantics and the diverse caption sampling strategy.
Zero-Shot Semantic Segmentation: FLAIR shows massive improvements, boosting average mIoU by 14.4% over strong baselines like OpenCLIP-2B across multiple datasets (VOC20, ADE20k, etc.). This directly validates the improved localization of its image tokens ( $\mathbf{v}^{\text{loc}}$ ).
Zero-Shot Classification: FLAIR performs comparably to DreamLIP when trained on 30M data but significantly lags behind models trained on billions of images. This suggests that while FLAIR's approach enhances fine-grained understanding, large data scale is still crucial for broad concept coverage needed in classification.
Qualitative Analysis: Visualizations of the attention maps from $f_{\text{AttnPool}(.)}$ show that FLAIR correctly focuses on image regions corresponding to specific objects or attributes mentioned in the text query (Figure \ref{fig:attn_weights_vis}, \ref{fig:appendix_attn_map_short}, \ref{fig:appendix_attn_map_long}). Token-similarity maps also confirm better localization than previous methods (Figure \ref{fig:teaser_small}, \ref{fig:tokenwise_sim_appendix}).

Practical Implications:

Provides a method to significantly improve the fine-grained understanding and localization capabilities of VLMs without needing billions of training samples, leveraging detailed synthetic captions instead.
The text-conditioned attention pooling is a novel mechanism for dynamically adapting image representations based on textual context.
Demonstrates strong performance gains in retrieval (especially fine-grained and long-text) and zero-shot segmentation, tasks directly benefiting from detailed alignment.
The diverse caption sampling strategy is shown to be effective for handling both short and long text queries.

Limitations:

Performance on global tasks like zero-shot classification still depends heavily on the scale of the image dataset, not just the quality of captions.
Relies on the availability of long, descriptive captions, which often require generation using powerful MLLMs. However, ablations show some benefit even with standard captions.

In summary, FLAIR presents a practical and effective approach to infuse VLMs with fine-grained understanding by using detailed captions, diverse sampling, and a text-conditioned attention mechanism, achieving state-of-the-art results on several fine-grained tasks with significantly less data than previous large-scale models.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Rui Xiao (18 papers)
Sanghwan Kim (6 papers)
Mariana-Iuliana Georgescu (27 papers)
Zeynep Akata (144 papers)
Stephan Alaniz (20 papers)

Related Papers

Find Related Papers

GitHub

GitHub - ExplainableML/flair

Tweets

https://twitter.com/jbohnslav/status/1864801748780781994