Normalized and Geometry-Aware Self-Attention Network for Image Captioning (2003.08897v1)

Published 19 Mar 2020 in cs.CV, cs.CL, and cs.MM

Abstract: Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA. While normalization is previously only applied outside SA, we introduce a novel normalization method and demonstrate that it is both possible and beneficial to perform it on the hidden activations inside SA. Second, to compensate for the major limit of Transformer that it fails to model the geometry structure of the input objects, we propose a class of Geometry-aware Self-Attention (GSA) that extends SA to explicitly and efficiently consider the relative geometry relations between the objects in the image. To construct our image captioning model, we combine the two modules and apply it to the vanilla self-attention network. We extensively evaluate our proposals on MS-COCO image captioning dataset and superior results are achieved when comparing to state-of-the-art approaches. Further experiments on three challenging tasks, i.e. video captioning, machine translation, and visual question answering, show the generality of our methods.

Citations (170)

View on Semantic Scholar

Summary

Normalized and Geometry-Aware Self-Attention Network for Image Captioning

The paper presents advancements in the domain of image captioning through the introduction of a novel self-attention network architecture. This architecture incorporates two significant modifications to the traditional self-attention (SA) mechanism, aiming to enhance image captioning performance: Normalized Self-Attention (NSA) and Geometry-aware Self-Attention (GSA).

Methodological Contributions

Normalized Self-Attention (NSA): NSA addresses the internal covariate shift problem within the SA. This issue arises due to the dynamic computation of layer parameters in SA, which causes instability during training. The authors apply a unique normalization technique within the SA, effectively eliminating covariate shift. By normalizing the input queries using Instance Normalization (IN), the distributional drift is mitigated, leading to more consistent training and better model generalization. Unlike traditional Layer Normalization that operates on elements, the proposed normalization functions independently on each channel across instances, which is critical when parameters are instance-specific.
Geometry-aware Self-Attention (GSA): The authors critique the traditional self-attention for its inability to capture geometric relationships between image objects. To address this, they introduce GSA, which incorporates relative geometric features into the attention computation. The geometry relations are modeled as biases in the attention mechanism. GSA calculates the attention scores by combining content-based weights and geometry-derived biases, effectively capturing spatial relationships. The proposed approaches enhance the SA's ability to discern object relations, crucial for comprehensively understanding visual inputs.

Experimental Evaluations

The proposed methods are thoroughly tested on the MS-COCO dataset for image captioning, where they establish new benchmarks. The NG-SAN (Normalized and Geometry-aware Self-Attention Network) achieves a notable performance, surpassing state-of-the-art models by enhancing CIDEr metrics from 125.5 to 128.6 on the MS-COCO evaluation server.

Further experiments evaluate the generality of these methods on video captioning, machine translation, and visual question answering. In all tasks, the application of NSA and GSA, respectively, demonstrates substantial improvements over baseline models, confirming the versatility of these techniques.

Implications and Future Directions

Practically, the integration of NSA and GSA in SA networks tends to enhance performance with minimal additional computational cost, shedding light on their utility in tasks extending beyond image captioning. Theoretically, the introduction of normalization within the attention mechanism, along with the explicit geometric modeling, offers new insights into attention-based architectures' design.

For future exploration, these advancements suggest a potential for further optimization and adaptation across various domains involving complex data representations. Additionally, exploring combinations of other normalization strategies within attention mechanisms might yield further improvements. The emergence of NSA and GSA will likely spur more strategies that utilize geometric and distributional considerations to enhance neural network expressiveness and stability.

In summary, the authors present a rigorous approach to improving SA networks, providing a foundation for subsequent developments in AI-driven image analysis and natural language generation tasks. The methodologies proposed serve as robust enhancements to the conventional frameworks utilized for interpretive AI models.