Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders (2008.05231v2)

Published 12 Aug 2020 in cs.CV

Abstract: Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences, i.e., image regions and words, respectively, in order to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way towards the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN.

PDF Abstract

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

The paper, authored by Messina et al., presents a novel approach for cross-modal retrieval, focusing on image-sentence matching using a method termed Transformer Encoder Reasoning and Alignment Network (TERAN). This method tackles the inherent challenges in multi-modal matching by leveraging the transformer encoder architecture to achieve a fine-grained alignment between visual and textual modalities. The significance of TERAN lies in its ability to maintain information richness by aligning image regions with corresponding words within sentences, ensuring scalability in large-scale retrieval systems.

TERAN is evaluated on the widely recognized MS-COCO and Flickr30k datasets, demonstrating superior performance compared to contemporary methods in both image and sentence retrieval tasks. The approach reportedly improves Recall@1 scores by 5.7% for image retrieval and 3.5% for sentence retrieval on the MS-COCO 1K test set. Such advancements indicate TERAN's efficacy in achieving state-of-the-art results by focusing on precise region-word alignments.

Critically, TERAN addresses the scalability issue in cross-modal retrieval by separating the visual and textual pipelines until the final alignment phase before loss computation. This design decision precludes cross-attention mechanisms that hinder efficient feature extraction and indexing essential for large-scale retrieval tasks. By using transformer encoders, TERAN discovers relationships among image regions and sentence words through self-attention mechanisms, a pivotal feature ensuring meaningful alignments in a semi-supervised fashion without explicit region-word supervision.

The authors substantiate TERAN's capabilities through a series of systematic evaluations, comparing it with various baseline approaches and highlighting the superior numerical results achieved. In conjunction with traditional Recall@K metrics, the authors deploy the NDCG (Normalized Discounted Cumulative Gain) metric, incorporating ROUGE-L and SPICE for measuring relevance based on caption similarities. This dual-мetric evaluation provides a nuanced understanding of TERAN's retrieval performance, particularly in scenarios demanding non-exact but relevant matches.

TERAN's architecture, which merges information from visual and textual domains only during the final alignment phase, paves the way for future research in efficient cross-modal retrieval systems. The paper suggests potential avenues for refining transformer-based multi-modal approaches to further enhance the retrieval accuracy and efficiency.

In their ablation paper, the authors explore model variations, including weight sharing, pooling strategies, and the impact of different LLMs, offering comprehensive insights into TERAN's operational mechanics. These experiments underscore the flexibility and adaptability of TERAN in diverse computational settings, further cementing its applicability in real-world scenarios.

Overall, this paper advances the field by addressing pivotal challenges in cross-modal information retrieval through innovative architectural choices and extensive empirical validation. The methods and findings presented have broad implications for the development of more sophisticated, scalable multi-modal systems, influencing both theoretical inquiries and practical applications in artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Nicola Messina (23 papers)
Giuseppe Amato (47 papers)
Andrea Esuli (30 papers)
Fabrizio Falchi (58 papers)
Claudio Gennaro (38 papers)
Stéphane Marchand-Maillet (34 papers)

Citations (130)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - mesnico/TERAN: Code and Resources for the Transformer Encoder Reasoning and Alignment Network (TERAN), accepted for publication in ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (73 stars)