Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 64 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 77 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval (2406.09952v2)

Published 14 Jun 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Existing Vision-Language Compositionality (VLC) benchmarks like SugarCrepe are formulated as image-to-text retrieval problems, where, given an image, the models need to select between the correct textual description and a synthetic hard negative text. In this work, we present the Bidirectional Vision-Language Compositionality (BiVLC) dataset. The novelty of BiVLC is to add a synthetic hard negative image generated from the synthetic text, resulting in two image-to-text retrieval examples (one for each image) and, more importantly, two text-to-image retrieval examples (one for each text). Human annotators filter out ill-formed examples ensuring the validity of the benchmark. The experiments on BiVLC uncover a weakness of current multimodal models, as they perform poorly in the text-to-image direction. In fact, when considering both retrieval directions, the conclusions obtained in previous works change significantly. In addition to the benchmark, we show that a contrastive model trained using synthetic images and texts significantly improves over the base model in SugarCrepe and in BiVLC for both retrieval directions. The gap to human performance in BiVLC confirms that Vision-Language Compositionality is still a challenging problem. BiVLC and code are available at https://imirandam.github.io/BiVLC_project_page.

Collections

Summary

The paper introduces BiVLC, a dataset extending evaluations to both image-to-text and text-to-image retrieval.
It reveals that current multimodal models perform significantly worse on text-to-image tasks compared to human accuracy.
Incorporating synthetic hard negative images in contrastive training improves performance on both SugarCrepe and BiVLC benchmarks.

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

Overview

The paper presents BiVLC, a novel benchmark designed to evaluate Vision-Language Compositionality (VLC) using bidirectional retrieval tasks, encompassing both image-to-text (I2T) and text-to-image (T2I) directions. Traditional VLC benchmarks such as SugarCrepe have primarily focused on the I2T retrieval challenge, where given an image, a model must select the correct textual description from a set including hard negative distractors. BiVLC innovates by incorporating synthetic hard negative images generated from these hard negative texts, enabling a thorough evaluation of models in both I2T and T2I tasks.

Human annotators validate these images to ensure high quality, eliminating ill-formed examples. The dataset reveals significant performance disparities in current multimodal models, particularly their poor performance in the T2I direction. The experiments demonstrated that the inclusion of synthetic images in contrastive training improves model performance on both the traditional SugarCrepe benchmark and the proposed BiVLC benchmark, although there remains a notable gap to human performance, indicating that VLC remains an ongoing challenge.

Experimental Contributions

The paper's contributions are multifold:

The introduction of the BiVLC dataset, which extends SugarCrepe by adding negative images, thereby supporting both I2T and T2I retrieval directions.
Validation through human annotation to filter out invalid or ambiguous instances, ensuring dataset integrity.
Performance analyses demonstrating that current models underperform in the T2I task, indicating an imbalance compared to human performance.
Training enhancements using contrastive models with synthetic images and texts, which show improved state-of-the-art performance on both SugarCrepe and BiVLC benchmarks.

Findings and Model Performance

The research identifies several noteworthy findings:

Humans perform comparably in both I2T and T2I tasks, whereas current multimodal models show significantly worse performance in T2I retrieval.
Bidirectional VLC is confirmed to be more difficult than I2T retrieval alone.
There is a lack of correlation between model performance on SugarCrepe and BiVLC, challenging the assumption that optimizing for unidirectional VLC generalizes to bidirectional scenarios.
Training with hard negative images significantly boosts model performance, with the CLIP\textsubscript{TROHN-Img} model showing particularly strong results in BiVLC.

Practical and Theoretical Implications

The findings have both practical and theoretical implications:

Practical: The creation of BiVLC sets a new standard for evaluating VLC, pushing the boundaries of how models are trained and tested in multimodal contexts. Models must now be capable of comprehending and generating accurate visual-text correspondences in both I2T and T2I directions.
Theoretical: The significant gap between model and human performance in T2I tasks points to uncharted territories in multimodal learning, suggesting that current model architectures and training paradigms need fundamental advancements.

Speculation on Future Developments in AI

Future work may look into several areas of improvement:

Enhanced Generative Models: Improving the fidelity and diversity of synthetic image generation will likely reduce noise in training datasets like TROHN-Img, enabling models to learn more robust visual-text correspondences.
Bidirectional Training Approaches: Novel training techniques that simultaneously optimize for I2T and T2I tasks could bridge the performance gap seen in current models.
Cross-lingual VLC: Extending benchmarks like BiVLC to multiple languages could provide a comprehensive understanding of VLC across different linguistic contexts.

BiVLC represents a significant step forward in multimodal research, challenging existing models and providing a robust framework for future innovations in Vision-Language Compositionality. The dataset, along with its foundational findings, lays the groundwork for advancing the capabilities of multimodal AI systems.