Grounding of Textual Phrases in Images by Reconstruction (1511.03745v4)

Published 12 Nov 2015 in cs.CV, cs.CL, and cs.LG

Abstract: Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network LLM and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the effectiveness of our approach on the Flickr 30k Entities and ReferItGame datasets with different levels of supervision, ranging from no supervision over partial supervision to full supervision. Our supervised variant improves by a large margin over the state-of-the-art on both datasets.

PDF Abstract

Grounding of Textual Phrases in Images by Reconstruction

In the context of bridging visual and linguistic information, the paper "Grounding of Textual Phrases in Images by Reconstruction" by Rohrbach et al. explores a methodology for localizing natural language phrases within images. The challenge addressed is significant in both computer vision and natural language processing, focusing on scenarios with varying levels of grounding supervision.

Methodology

The proposed methodology involves a two-part model leveraging an attention mechanism to associate phrases with image regions. The model first identifies relevant image regions for each phrase and then reconstructs the phrase from these regions. This reconstruction-based grounding can operate under three supervision scenarios: unsupervised, semi-supervised, and fully supervised.

Attention Mechanism: At the core of the network, a soft-attention mechanism directs focus on image regions pertinent to a phrase. The model uses a Long Short-Term Memory (LSTM) network to encode phrases and a Convolutional Neural Network (CNN) for visual feature extraction.
Reconstruction Loss: To facilitate grounding when direct supervision is unavailable, the model employs a reconstruction strategy. The accuracy of this reconstruction drives the attention mechanism as the model learns to minimize reconstruction error.
Supervision Modes: The model dynamically adapts depending on availability of phrase-box annotations. In semi-supervised scenarios, a combination of reconstruction and attention-based loss is optimized. The fully supervised variant utilizes an annotated attention loss for improved accuracy.

Results

Empirical evaluations on the Flickr 30k Entities and ReferItGame datasets demonstrate the model's efficacy. Notably, the unsupervised variant achieves an accuracy of 28.94% on the Flickr 30k Entities dataset using VGG-DET features, while the fully supervised model attains 47.81%, surpassing current state-of-the-art methods. Furthermore, the semi-supervised approach shows remarkable performance improvements with minimal supervision. For instance, using only 3.12% of annotated data yields significant accuracy gains over unsupervised methods.

On the ReferItGame dataset, the model achieves 26.93% accuracy in a fully supervised setting, significantly outperforming previous benchmarks like the SCRC method, which records a 17.93% accuracy.

Implications and Future Directions

The implications of this research are notable both practically and theoretically. Practically, the approach aids applications such as image captioning, interactive systems, and enhanced image search capabilities. Theoretically, it advances the understanding of multi-modal representation learning, particularly in integrating reconstruction-based objectives with attention mechanisms.

Future research directions may explore extending the model to other forms, such as incorporating segmentation proposals instead of bounding boxes, enforcing sentence-level constraints during training, and modeling inter-phrase spatial relationships. Such enhancements could further elevate the precision and applicability of grounding techniques across various domains.

Overall, the paper offers significant advancements in phrase grounding, providing a robust framework adaptable to varying supervision levels and setting a precedent for subsequent explorations in visual-linguistic grounding tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Anna Rohrbach (53 papers)
Marcus Rohrbach (75 papers)
Ronghang Hu (26 papers)
Trevor Darrell (324 papers)
Bernt Schiele (210 papers)

Citations (486)

View on Semantic Scholar

Grounding of Textual Phrases in Images by Reconstruction (1511.03745v4)