- The paper introduces BiVLC, a dataset extending evaluations to both image-to-text and text-to-image retrieval.
- It reveals that current multimodal models perform significantly worse on text-to-image tasks compared to human accuracy.
- Incorporating synthetic hard negative images in contrastive training improves performance on both SugarCrepe and BiVLC benchmarks.
BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval
Overview
The paper presents BiVLC, a novel benchmark designed to evaluate Vision-Language Compositionality (VLC) using bidirectional retrieval tasks, encompassing both image-to-text (I2T) and text-to-image (T2I) directions. Traditional VLC benchmarks such as SugarCrepe have primarily focused on the I2T retrieval challenge, where given an image, a model must select the correct textual description from a set including hard negative distractors. BiVLC innovates by incorporating synthetic hard negative images generated from these hard negative texts, enabling a thorough evaluation of models in both I2T and T2I tasks.
Human annotators validate these images to ensure high quality, eliminating ill-formed examples. The dataset reveals significant performance disparities in current multimodal models, particularly their poor performance in the T2I direction. The experiments demonstrated that the inclusion of synthetic images in contrastive training improves model performance on both the traditional SugarCrepe benchmark and the proposed BiVLC benchmark, although there remains a notable gap to human performance, indicating that VLC remains an ongoing challenge.
Experimental Contributions
The paper's contributions are multifold:
- The introduction of the BiVLC dataset, which extends SugarCrepe by adding negative images, thereby supporting both I2T and T2I retrieval directions.
- Validation through human annotation to filter out invalid or ambiguous instances, ensuring dataset integrity.
- Performance analyses demonstrating that current models underperform in the T2I task, indicating an imbalance compared to human performance.
- Training enhancements using contrastive models with synthetic images and texts, which show improved state-of-the-art performance on both SugarCrepe and BiVLC benchmarks.
The research identifies several noteworthy findings:
- Humans perform comparably in both I2T and T2I tasks, whereas current multimodal models show significantly worse performance in T2I retrieval.
- Bidirectional VLC is confirmed to be more difficult than I2T retrieval alone.
- There is a lack of correlation between model performance on SugarCrepe and BiVLC, challenging the assumption that optimizing for unidirectional VLC generalizes to bidirectional scenarios.
- Training with hard negative images significantly boosts model performance, with the CLIP\textsubscript{TROHN-Img} model showing particularly strong results in BiVLC.
Practical and Theoretical Implications
The findings have both practical and theoretical implications:
- Practical: The creation of BiVLC sets a new standard for evaluating VLC, pushing the boundaries of how models are trained and tested in multimodal contexts. Models must now be capable of comprehending and generating accurate visual-text correspondences in both I2T and T2I directions.
- Theoretical: The significant gap between model and human performance in T2I tasks points to uncharted territories in multimodal learning, suggesting that current model architectures and training paradigms need fundamental advancements.
Speculation on Future Developments in AI
Future work may look into several areas of improvement:
- Enhanced Generative Models: Improving the fidelity and diversity of synthetic image generation will likely reduce noise in training datasets like TROHN-Img, enabling models to learn more robust visual-text correspondences.
- Bidirectional Training Approaches: Novel training techniques that simultaneously optimize for I2T and T2I tasks could bridge the performance gap seen in current models.
- Cross-lingual VLC: Extending benchmarks like BiVLC to multiple languages could provide a comprehensive understanding of VLC across different linguistic contexts.
BiVLC represents a significant step forward in multimodal research, challenging existing models and providing a robust framework for future innovations in Vision-Language Compositionality. The dataset, along with its foundational findings, lays the groundwork for advancing the capabilities of multimodal AI systems.