Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing
Referring expression grounding is a complex task within the domain of vision and language that requires accurate cross-modal alignments to locate objects in an image based on natural language descriptions. This paper introduces a novel approach to enhance the grounding process by employing an innovative cross-modal attention-guided erasing method. Unlike previous models that predominantly focus on capturing the most salient feature alignments between text and visual elements, this approach attempts to uncover and learn from the complementary correspondences that are often overlooked.
The proposed method stands out by leveraging an erasing mechanism where dominant textual or visual features, indicated by attention weights, are deliberately discarded during the training phase. This process yields harder training samples that push the model to seek additional evidence across both modalities, ultimately encouraging the learning of richer dual-modal correspondences.
The methodology is structured around three distinct erasing mechanisms:
- Image-aware Query Sentence Erasing: Here, the model determines word importance in the sentence based on visual context and attention levels. Key words are replaced with an "unknown" token, thereby maintaining sentence structure while eliminating the influence of those words, encouraging the model to explore alternative alignments.
- Sentence-aware Subject Region Erasing: This mechanism involves erasing critical regions on the subject module’s feature map determined by sentence-aware spatial attention. It forces the model to rediscover complementary regions over focusing only on the most discriminative areas.
- Sentence-aware Context Object Erasing: Essential for the location and relationship modules, this approach discards prominent context objects as determined by sentence-aware attention, encouraging the model to leverage other regions or modules.
The paper underscores the superiority of attention-guided erasing over other techniques, such as random erasing or adversarially selecting regions, by highlighting the natural tendency of attention mechanisms to emphasize the most salient features. This can suppress back-propagation signal efficiency to less prominent, but still relevant, features. By shifting or removing the attention away from dominant pairs, the method acts as a structured regularization technique that directs the learning process towards a broader set of cross-modal interactions.
To validate the proposed approach, extensive experiments were conducted on the RefCOCO, RefCOCO+, and RefCOCOg datasets. Across these datasets, the method achieved state-of-the-art performance, surpassing previous models by effectively capturing a wider array of textual and visual features required for accurate grounding.
The theoretical implications of this research are significant, suggesting that a shift from general attention strategies to targeted erasing can enhance understanding in tasks combining vision and language. Practically, this model does not increase complexity during inference, making it scalable and efficient for real-world applications.
Future directions may explore automated refinement of erasing methods, potentially incorporating dynamic attention weights that evolve during the training process. Additionally, expanding upon the types of textual-visual relationships that can be learned through erasing could enhance model adaptability across a broader spectrum of multimodal tasks. This paper represents a substantial contribution to referring expression grounding by bridging gaps in understanding diverse cross-modal interactions through systematic feature erasing.