Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FLAIR: VLM with Fine-grained Language-informed Image Representations (2412.03561v1)

Published 4 Dec 2024 in cs.CV and cs.AI
FLAIR: VLM with Fine-grained Language-informed Image Representations

Abstract: CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-LLM to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-LLMs' ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs. Code is available at https://github.com/ExplainableML/flair .

This paper introduces FLAIR (Fine-grained Language-informed Image Representations), a Vision-LLM (VLM) designed to overcome the limitations of models like CLIP in understanding fine-grained visual details. While CLIP excels at aligning images and text globally, it often struggles to capture localized information, hindering performance in tasks requiring detailed understanding.

Problem Addressed:

Standard VLMs like CLIP learn coarse-grained alignments between entire images and text captions. This global matching approach leads to a loss of local details, making it difficult to distinguish or localize specific objects, attributes, or regions mentioned in a text query (as illustrated in Figure 1).

FLAIR's Approach:

FLAIR aims to learn image representations that are sensitive to fine-grained textual descriptions. Key components include:

  1. Leveraging Detailed Captions: FLAIR utilizes datasets with long, descriptive captions (often synthetically generated by Multimodal LLMs - MLLMs). These captions provide rich semantic information about various aspects of an image, including specific objects, attributes, and spatial relationships.
  2. Diverse Caption Sampling: Instead of using the full long caption, FLAIR employs a strategy to sample multiple (KK) sub-captions for each image. Each sub-caption consists of a varying number (ss) of sentences (e.g., 1 to 3 sentences) sampled either consecutively or randomly from the original long caption. This creates a mix of local (short sub-captions focusing on details) and global (longer sub-captions describing more of the scene) textual descriptions for each image.
  3. Text-Conditioned Attention Pooling: This is the core architectural innovation. FLAIR introduces an attention pooling mechanism where the global text embedding (tg\mathbf{t}^{\text{g}}) of a sub-caption acts as a query to pool information from the local image patch tokens (vloc\mathbf{v}^{\text{loc}}). This produces a text-conditioned image representation (vtc\mathbf{v}^{\text{tc}}) that is specifically tailored to the semantics of the input text query.
    • The pooling is defined as: vtc=fAttnPool(tg,vloc)=softmax(tgWq(vlocWk)Td)vlocWv\mathbf{v}^{\text{tc}} = f_{\text{AttnPool}}(\mathbf{t}^{\text{g}}, \mathbf{v}^{\text{loc}}) = \text{softmax} \left( \frac{\mathbf{t}^{\text{g}} W_q (\mathbf{v}^{\text{loc}} W_k)^T}{\sqrt{d}} \right) \mathbf{v}^{\text{loc}} W_v. A multi-head attention layer is used in practice, and an empty token is added to allow the text query to attend to nothing if irrelevant.
  4. Careful Negative Pair Selection: For the contrastive loss involving the text-conditioned image embeddings, FLAIR defines positive pairs as $\langle \mathbf{v}^{\text{tc}_{i,i_k}}, \mathbf{t}^{\text{g}_{i_k} \rangle$ (image ii, conditioned on its own kk-th caption, compared with that same caption). Crucially, it defines negative pairs as $\langle \mathbf{v}^{\text{tc}_{i,j_k}}, \mathbf{t}^{\text{g}_{j_k} \rangle$ (image ii, conditioned on caption kk from a different image jj, compared with that same caption jkj_k). This avoids a shortcut where the model could simply match the conditioning text with the comparison text, forcing it to use image information.
  5. Combined Loss Function: FLAIR uses a combination of two sigmoid-based contrastive losses (inspired by SigLIP, which handles multiple positives well):
    • Text-Conditioned Sigmoid Loss (Ltcs\mathcal{L}^{\text{tcs}}): Aligns the text-conditioned image embedding (vtci,jk\mathbf{v}^{\text{tc}_{i,j_k}}) with the corresponding text embedding (tgjk\mathbf{t}^{\text{g}_{j_k}}). This loss drives the fine-grained alignment.
    • Multi-Positive Sigmoid Loss (Lmps\mathcal{L}^{\text{mps}}): Aligns the global image embedding (vgi\mathbf{v}^{\text{g}_i}) with all its corresponding sampled sub-caption embeddings (tgik\mathbf{t}^{\text{g}_{i_k}}). This helps maintain coarse-grained understanding.
    • The final loss is L=12(Ltcs+Lmps)\mathcal{L} = \frac{1}{2}(\mathcal{L}^{\text{tcs}} + \mathcal{L}^{\text{mps}}).

Implementation Details:

  • Base Model: Built upon OpenCLIP, using ViT-B/16 image encoder and standard text transformer encoder (77-token limit).
  • Training Data: Pre-trained primarily on re-captioned datasets from DreamLIP (CC3M-recap, CC12M-recap, YFCC15M-recap, totaling 30M pairs). Ablations also show effectiveness on original CC3M and PixelProse datasets.
  • Training Setup: Used AdamW optimizer, cosine decay learning rate schedule, batch sizes from 1k to 6k. Sampled K=8K=8 sub-captions per image, each merging s=1s=1 to S=3S=3 sentences randomly.
  • Inference: For retrieval, image embeddings are conditioned on each potential text candidate before computing similarity. For segmentation, similarity is computed between local image tokens (vloc\mathbf{v}^{\text{loc}}) and class name embeddings (tg\mathbf{t}^{\text{g}}), either directly (FLAIR-CLIP style) or using the text-conditioned embeddings (FLAIR-TC style).

Evaluation and Key Results:

FLAIR was evaluated on several tasks against CLIP, SigLIP, DreamLIP (trained on the same 30M data), and SOTA models trained on billions of pairs (OpenCLIP-2B, SigLIP-10B, etc.).

  • Standard Retrieval (MSCOCO, Flickr30k): FLAIR significantly outperforms baselines trained on the same data (e.g., +7.9% T2I R@1 on COCO vs. DreamLIP) and surpasses or matches models trained on billions of pairs (e.g., FLAIR-30M beats SigLIP-10B).
  • Fine-grained Retrieval (DOCCI-FG, IIW-FG - new benchmarks): FLAIR excels, demonstrating superior ability to match images with captions describing specific details. It outperforms DreamLIP-30M by 3.4%-7.8% R@1 and even beats SigLIP-10B.
  • Long Retrieval (DCI, SV-1k/10k, Urban-1k): Despite using a standard 77-token text encoder, FLAIR-30M outperforms specialized long-caption models like Long-CLIP and LoTLIP (trained on 100M-400M data) on several benchmarks (e.g., +10.4% T2I R@1 on SV-1k vs LoTLIP). This is attributed to the text-conditioned pooling adapting to rich semantics and the diverse caption sampling strategy.
  • Zero-Shot Semantic Segmentation: FLAIR shows massive improvements, boosting average mIoU by 14.4% over strong baselines like OpenCLIP-2B across multiple datasets (VOC20, ADE20k, etc.). This directly validates the improved localization of its image tokens (vloc\mathbf{v}^{\text{loc}}).
  • Zero-Shot Classification: FLAIR performs comparably to DreamLIP when trained on 30M data but significantly lags behind models trained on billions of images. This suggests that while FLAIR's approach enhances fine-grained understanding, large data scale is still crucial for broad concept coverage needed in classification.
  • Qualitative Analysis: Visualizations of the attention maps from fAttnPool(.)f_{\text{AttnPool}(.)} show that FLAIR correctly focuses on image regions corresponding to specific objects or attributes mentioned in the text query (Figure \ref{fig:attn_weights_vis}, \ref{fig:appendix_attn_map_short}, \ref{fig:appendix_attn_map_long}). Token-similarity maps also confirm better localization than previous methods (Figure \ref{fig:teaser_small}, \ref{fig:tokenwise_sim_appendix}).

Practical Implications:

  • Provides a method to significantly improve the fine-grained understanding and localization capabilities of VLMs without needing billions of training samples, leveraging detailed synthetic captions instead.
  • The text-conditioned attention pooling is a novel mechanism for dynamically adapting image representations based on textual context.
  • Demonstrates strong performance gains in retrieval (especially fine-grained and long-text) and zero-shot segmentation, tasks directly benefiting from detailed alignment.
  • The diverse caption sampling strategy is shown to be effective for handling both short and long text queries.

Limitations:

  • Performance on global tasks like zero-shot classification still depends heavily on the scale of the image dataset, not just the quality of captions.
  • Relies on the availability of long, descriptive captions, which often require generation using powerful MLLMs. However, ablations show some benefit even with standard captions.

In summary, FLAIR presents a practical and effective approach to infuse VLMs with fine-grained understanding by using detailed captions, diverse sampling, and a text-conditioned attention mechanism, achieving state-of-the-art results on several fine-grained tasks with significantly less data than previous large-scale models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Rui Xiao (18 papers)
  2. Sanghwan Kim (6 papers)
  3. Mariana-Iuliana Georgescu (27 papers)
  4. Zeynep Akata (144 papers)
  5. Stephan Alaniz (20 papers)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com