Predicting Visual Features from Text for Image and Video Caption Retrieval (1709.01362v3)

Published 5 Sep 2017 in cs.CV

Abstract: This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute \emph{Word2VisualVec}, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec's properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.

PDF Abstract

Essay: Predicting Visual Features from Text for Image and Video Caption Retrieval

The paper, titled "Predicting Visual Features from Text for Image and Video Caption Retrieval" by Jianfeng Dong, Xirong Li, and Cees G. M. Snoek introduces a novel approach to image and video caption retrieval that operates within the visual feature space. This represents a departure from traditional cross-modal retrieval methods that typically rely on joint subspace representations. The key contribution of this work is the introduction of Word2VisualVec, a deep neural network architecture which predicts a visual feature representation directly from textual input. The authors argue that the approach leverages the increasing effectiveness of deep learning-based visual features, allowing for a simpler model that leverages the visual space rather than needing to learn a more complex joint subspace.

Word2VisualVec was evaluated through extensive experimentation on several benchmark datasets including Flickr8k, Flickr30k for images, and the MSVD and TrecVid challenges for videos. The results indicate that Word2VisualVec outperforms traditional joint subspace methods for image caption retrieval, highlighting its potential as a state-of-the-art approach. Specifically, the model achieved performance improvements across most recall metrics (R@1, R@5, R@10), demonstrating that prediction in the visual space provides a viable solution to caption retrieval problems.

The neural network architecture utilized by Word2VisualVec includes a combination of multi-scale sentence vectorization techniques, encompassing bag-of-words, word2vec, and RNN (particular Gated Recurrent Unit, GRU) embeddings to handle varying sentence lengths and complexities. The architecture further leverages a multi-layer perceptron (MLP) to map textual inputs into the chosen visual feature space, extracted from pre-trained convolutional neural networks like GoogLeNet and ResNet. An important observation of the paper is that multi-scale sentence vectorization, combined with deeper ConvNet features, yields optimal retrieval performance.

The research underscores the benefits of using a one-way mapping from text to visual modality as opposed to learning a joint subspace, noting the efficiency in retrieval tasks due to precomputed sentence vectors. This attribute becomes particularly beneficial in large-scale datasets where real-time capabilities are required.

Furthermore, the concept of multi-modal query composition is enabled through Word2VisualVec, as demonstrated by performing operations such as addition and subtraction of predicted visual features from textual inputs. This capability invites the possibility of more flexible query formulations in practical applications.

While Word2VisualVec achieves impressive results, the authors note potential limitations and areas for future exploration, such as combining their retrieval system with generative models for cases where pre-defined sentences are not available. They also suggest integrating additional modalities and finer-grain visual features, which could potentially enhance retrieval accuracy and applicability across more diverse domains.

The implications of this research extend to various domains reliant on efficient and accurate multimedia retrieval systems. By bypassing the need for a shared latent space, Word2VisualVec simplifies the retrieval process and focuses computational resources on optimizing the text-to-visual feature mapping. Future developments in AI leveraging this framework could lead to more efficient retrieval systems, particularly when processing multimedia in time-sensitive environments is essential.

In summary, the paper provides compelling evidence supporting the use of visual spaces for caption retrieval tasks, setting a foundation for future research to explore further integration with advanced deep learning architectures and broader multimedia applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Jianfeng Dong (38 papers)
Xirong Li (64 papers)
Cees G. M. Snoek (134 papers)

Citations (215)

View on Semantic Scholar

Predicting Visual Features from Text for Image and Video Caption Retrieval (1709.01362v3)

Essay: Predicting Visual Features from Text for Image and Video Caption Retrieval

Related Papers