Instance-aware Image and Sentence Matching with Selective Multimodal LSTM (1611.05588v1)

Published 17 Nov 2016 in cs.CV

Abstract: Effective image and sentence matching depends on how to well measure their global visual-semantic similarity. Based on the observation that such a global similarity arises from a complex aggregation of multiple local similarities between pairwise instances of image (objects) and sentence (words), we propose a selective multimodal Long Short-Term Memory network (sm-LSTM) for instance-aware image and sentence matching. The sm-LSTM includes a multimodal context-modulated attention scheme at each timestep that can selectively attend to a pair of instances of image and sentence, by predicting pairwise instance-aware saliency maps for image and sentence. For selected pairwise instances, their representations are obtained based on the predicted saliency maps, and then compared to measure their local similarity. By similarly measuring multiple local similarities within a few timesteps, the sm-LSTM sequentially aggregates them with hidden states to obtain a final matching score as the desired global similarity. Extensive experiments show that our model can well match image and sentence with complex content, and achieve the state-of-the-art results on two public benchmark datasets.

Authors (3)

Yan Huang (180 papers)
Wei Wang (1793 papers)
Liang Wang (512 papers)

Citations (219)

View on Semantic Scholar

Summary

Instance-aware Image and Sentence Matching with Selective Multimodal LSTM

The paper presents a novel approach to image and sentence matching through the development of a selective multimodal Long Short-Term Memory network (sm-LSTM). The central contribution of this research lies in its instance-aware methodology, which leverages both a multimodal context-modulated attention scheme and a tailored LSTM network for effective matching. The proposed framework addresses critical challenges in the domain by accurately measuring the visual-semantic similarity between images and sentences.

The crux of the problem in image-sentence matching is the effective aggregation of local similarities that arise from individual instance pairings (i.e., objects in images and corresponding words in sentences) into a global similarity score. The methodology employed by the sm-LSTM goes beyond previous works by focusing on selectively attending to image-sentence instance pairs and generating instance-aware saliency maps to sharpen the attention mechanism. This is accomplished through a sophisticated attention scheme that uses multimodal global context as a guiding reference, effectively enhancing the selection of salient instance pairs.

The sm-LSTM integrates this context-modulated attention with an LSTM network to capture local similarities at each timestep, sequentially aggregating these similarities into a comprehensive global similarity score. This design allows the sm-LSTM to dynamically select and weigh important local instances, mitigating the noise from irrelevant pairs that previous many-to-many approaches inadequately addressed.

In terms of empirical evaluation, the paper demonstrates the efficacy of the sm-LSTM across established benchmarks like the Flickr30K and Microsoft COCO datasets. The proposed model achieves state-of-the-art performance in tasks of image annotation and retrieval, with notable improvement in recall rates at various thresholds (R@1, R@5, R@10), outperforming several contemporary models, including those which utilize external data enhancements like structured objectives or text corpora. This underscores the effectiveness of the selective instance pairing and the integrated attention mechanism in addressing the complex cross-modal similarity measurement task.

The research indicates that leveraging both attention schemes and global context is crucial for improving the accuracy of image-sentence matching tasks. It also provides promising insights into the benefits of end-to-end trainable models in this domain, pointing towards potential future developments that could further refine instance-aware saliency prediction.

Future work could investigate more advanced implementations for context modulation within the attention framework. Additionally, potential expansions could explore other datasets and modalities, and the integration of fine-tuning capabilities for pretrained CNN components could potentially bolster performance further. This line of research enriches the multimodal processing domain and heralds advancements in how AI systems understand and integrate different types of data.

PDF Markdown

Instance-aware Image and Sentence Matching with Selective Multimodal LSTM (1611.05588v1)

Summary

Instance-aware Image and Sentence Matching with Selective Multimodal LSTM

Related Papers