Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

Published 23 Feb 2020 in cs.IR, cs.AI, cs.CL, cs.CV, and cs.LG | (2002.10016v1)

Abstract: This paper considers the task of matching images and sentences by learning a visual-textual embedding space for cross-modal retrieval. Finding such a space is a challenging task since the features and representations of text and image are not comparable. In this work, we introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously to infer image-text similarity. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking. To learn about the joint representations, we leverage our newly extracted collection of tweets from Twitter. The main characteristic of our dataset is that the images and tweets are not standardized the same as the benchmarks. Furthermore, there can be a higher semantic correlation between the pictures and tweets contrary to benchmarks in which the descriptions are well-organized. Experimental results on MS-COCO benchmark dataset show that our model outperforms certain methods presented previously and has competitive performance compared to the state-of-the-art. The code and dataset have been made available publicly.

Abstract PDF Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a deep multimodal framework that jointly learns visual and textual representations using a CNN and LSTM integration.
It employs a hinge-based triplet ranking loss to effectively align image-text pairs, demonstrating improved recall@K and median rank on benchmarks.
Experimental evaluation on a novel Twitter dataset highlights the model’s robustness and adaptability for real-world cross-media retrieval.

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

Introduction

The integration of social networking platforms with artificial intelligence has created vast opportunities for cross-media retrieval tasks, namely image-text matching. This research focuses on learning a visual-textual embedding space capable of matching images with corresponding sentences, utilizing a deep learning framework specifically designed for simultaneous vision and language representation learning. The objective of the study is to develop a model that effectively bridges the semantic gap between image and text modalities, thereby enhancing image-text similarity inference capabilities.

Figure 1: Motivation/Concept Figure: Given an image as input, the automatic retrieval of the closest textual description demonstrates the dataset's application.

Model Architecture

The paper introduces a multimodal neural network composed of a convolutional neural network (CNN) and a long short-term memory (LSTM) network, supplemented by fully-connected layers. This architecture facilitates the simultaneous learning of both vision and language representations. The network's two primary components are designed to transform visual and textual data into a shared feature space, enabling efficient cross-modal matching.

Image Representation: The image subnet leverages a modified ResNet-50, pre-trained on ImageNet, as the feature extractor, followed by a dense layer adapting the image features to align with text domain dimensions. This results in a robust visual feature vector, which is subsequently transformed via fully-connected layers into the shared embedding space.

Text Representation: Textual data is processed through a custom pipeline, starting with stemming and tokenization, followed by one-hot encoding. The LSTM then captures semantic features from the text input, ultimately converting them into feature vectors compatible with the shared embedding space.

Figure 2: Proposed end-to-end multimodal neural network architecture for learning the image and text representations.

Alignment Objective

The training process involves optimizing the image-text similarity using a hinge-based triplet ranking loss. The loss function is structured to encourage close proximity of correct image-text pairs within the embedding space while penalizing mismatched pairs. This approach ensures a reliable separation between positive and negative examples.

Experimental Evaluation

The model was tested against established benchmarks like MS-COCO to validate its efficacy and generalization across datasets. Extensive experimentation demonstrated that the proposed framework yields competitive results, surpassing several prior methods in sentence and image retrieval tasks. Recall@K and Median Rank metrics were employed to quantify retrieval accuracy, further highlighting the framework's effectiveness.

Data Collection

A novel dataset composed of tweets paired with images was collected from Twitter, presenting unique challenges due to the informal and varied nature of social media text. This dataset proved instrumental in demonstrating the model's robustness and adaptability to real-world image-text matching scenarios.

Conclusion

The paper successfully introduces a sophisticated multimodal architecture for image-text embedding and retrieval, contributing significant advancements to cross-media retrieval methodologies. The introduction of a new, diverse dataset further strengthens the application potential of the model. Future research may explore enhancements in multimodal embeddings and broader applications across different linguistic domains and real-world datasets.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

Summary

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

Introduction

Model Architecture

Alignment Objective

Experimental Evaluation

Data Collection

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (4)

Collections

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

Summary

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

Introduction

Model Architecture

Alignment Objective

Experimental Evaluation

Data Collection

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections