Attention on Attention for Image Captioning (1908.06954v2)

Published 19 Aug 2019 in cs.CV

Abstract: Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to guide the caption decoding process. However, the decoder has little idea of whether or how well the attended vector and the given attention query are related, which could make the decoder give misled results. In this paper, we propose an Attention on Attention (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries. AoA first generates an information vector and an attention gate using the attention result and the current context, then adds another attention by applying element-wise multiplication to them and finally obtains the attended information, the expected useful knowledge. We apply AoA to both the encoder and the decoder of our image captioning model, which we name as AoA Network (AoANet). Experiments show that AoANet outperforms all previously published methods and achieves a new state-of-the-art performance of 129.8 CIDEr-D score on MS COCO Karpathy offline test split and 129.6 CIDEr-D (C40) score on the official online testing server. Code is available at https://github.com/husthuaan/AoANet.

Authors (4)

Lun Huang (5 papers)
Wenmin Wang (9 papers)
Jie Chen (602 papers)
Xiao-Yong Wei (22 papers)

Citations (767)

View on Semantic Scholar

Summary

Attention on Attention for Image Captioning

The paper "Attention on Attention for Image Captioning" introduces the Attention on Attention (AoA) module, an advancement in attention mechanisms employed within encoder-decoder frameworks for image captioning. This module is designed to address the limitations in the conventional attention mechanisms by improving the relevance determination between attention results and queries. The proposed AoA module is integrated into an image captioning model termed AoANet, which shows substantial improvements over previous methods.

Summary of the AoA Module

The AoA module is an enhancement over traditional attention mechanisms which simply generate weighted averages of encoded vectors to guide the decoding process. In conventional models, the decoder receives little information about the relevance of the attention results, often leading to inaccuracies. The AoA module mitigates this by introducing an information vector and an attention gate.

The AoA module operates as follows:

Generate Information Vector and Attention Gate: It first calculates an information vector, storing newly obtained information from the attention result along with the query context, and an attention gate, indicating the relevance of information by transforming the attention result and query.
Second Attention: The gate then filters the information vector through an element-wise multiplication, isolating the useful information.

This double-layer attention ensures that the decoder receives more contextually relevant vectors at each step, thereby refining the caption generation process.

AoANet: Integrating AoA into Image Captioning

AoANet incorporates the AoA module into both its encoder and decoder. The encoder refines the feature vectors of objects within an image using a self-attentive mechanism combined with AoA, modeling inter-object relationships more effectively. Specifically, the AoA encoder applies a multi-head attention function to feature vectors, refining their relationships through an AoA-based feedback loop.

The decoder in AoANet uses AoA to manage context vectors for predicting the next word in the caption sequence. By using an LSTM to process input words and previous context, followed by an AoA module to refine the attention results, the decoder can effectively filter out irrelevant information and produce more accurate captions.

Experimental Results

AoANet significantly outperforms existing image captioning models on the MS COCO dataset. Quantitative results highlight:

AoANet achieves a CIDEr-D score of 129.8 on the MS COCO “Karpathy” offline test split and 129.6 on the official online testing server.
These scores represent the new state-of-the-art performance, improving roughly by 2-5% over previous leading models across several evaluation metrics such as BLEU, METEOR, ROUGE-L, and SPICE.

AoANet's ensemble model furthers this achievement with a CIDEr-D score of 132.0, demonstrating the effectiveness of the AoA module in both individual and combined model scenarios.

Implications and Future Work

Practically, AoANet represents an enhanced capability in generating more accurate and context-aware image descriptions, a critical advancement for applications in automatic content generation, accessibility technologies, and visual search systems. Theoretically, the AoA module exemplifies a methodological improvement in attention mechanisms, contributing to a refined understanding of how attention can be managed in neural network frameworks beyond image captioning, such as in video captioning and other sequence-to-sequence tasks.

Future developments could explore further optimizations and applications of AoA. Potential directions include:

Extending AoA into multi-modal systems where text, audio, and visual inputs are combined.
Investigating the integration of AoA modules into other attention-heavy domains, such as neural machine translation and sentiment analysis.
Exploring hierarchical and nested AoA structures for dealing with more complex datasets.

In conclusion, the introduction of the AoA module and its integration into AoANet sets a benchmark in image captioning performance, paving the way for more sophisticated attention mechanisms in various AI applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Jxueba/status/1837333871173763454