Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding (2106.04053v1)

Published 8 Jun 2021 in cs.CV and cs.MM

Abstract: In this paper, we are tackling the weakly-supervised referring expression grounding task, for the localization of a referent object in an image according to a query sentence, where the mapping between image regions and queries are not available during the training stage. In traditional methods, an object region that best matches the referring expression is picked out, and then the query sentence is reconstructed from the selected region, where the reconstruction difference serves as the loss for back-propagation. The existing methods, however, conduct both the matching and the reconstruction approximately as they ignore the fact that the matching correctness is unknown. To overcome this limitation, a discriminative triad is designed here as the basis to the solution, through which a query can be converted into one or multiple discriminative triads in a very scalable way. Based on the discriminative triad, we further propose the triad-level matching and reconstruction modules which are lightweight yet effective for the weakly-supervised training, making it three times lighter and faster than the previous state-of-the-art methods. One important merit of our work is its superior performance despite the simple and neat design. Specifically, the proposed method achieves a new state-of-the-art accuracy when evaluated on RefCOCO (39.21%), RefCOCO+ (39.18%) and RefCOCOg (43.24%) datasets, that is 4.17%, 4.08% and 7.8% higher than the previous one, respectively.

Authors (5)

Mingjie Sun (29 papers)
Jimin Xiao (38 papers)
Eng Gee Lim (38 papers)
Si Liu (130 papers)
John Y. Goulermas (8 papers)

Citations (161)

View on Semantic Scholar

Summary

Overview of Discriminative Triad Matching and Reconstruction in Weakly Referring Expression Grounding

The task of Referring Expression Grounding (REG) involves identifying and localizing a particular object within an image based on a descriptive query. This inherently multimodal challenge leverages both linguistic and visual data to achieve accurate object detection. Although traditionally tackled with fully supervised methodologies, where explicit mappings between sentences and image regions are used during training, the Weakly-Supervised Referring Expression Grounding (WREG) task seeks to eliminate the need for such mappings, thereby presenting a more challenging problem.

The presented method advances the state of the art in WREG by introducing a novel framework based on Discriminative Triad Matching and Reconstruction. This framework not only provides superior performance but also maintains efficiency and scalability, as demonstrated through rigorous evaluation on several benchmarking datasets.

Methodological Innovations

Discriminative Triad Representation: The core innovation lies in encoding the queries as discriminative triads, each comprising a target unit, a reference unit, and a discriminative unit. This triad effectively captures the essential relational information, enabling a more structured and nuanced approach to matching expressions with visual data.
Triad-Level Matching Module: Departing from traditional sentence-level matching, this method employs a triad-level matching mechanism. This approach effectively breaks down complex queries into manageable parts that can be matched directly with pairs of candidate image regions. The matching process uses attention mechanisms to calculate the compatibility between the visual features and the linguistic representation of each unit within the triad.
Triad-Level Reconstruction: The paper proposes a novel reconstruction process that eschews traditional RNN-based sentence reconstructions in favor of a mechanism that only reconstructs individual triad units. This facilitates a more accurate and computationally efficient feedback loop, enhancing the reliability of the back-propagation loss for refining the model.

Evaluative Performance

The proposed method considerably outperforms prior WREG approaches across multiple datasets including RefCOCO, RefCOCO+, and RefCOCOg. This is evidenced by accuracy improvements of up to 7.8% over previous state-of-the-art methods. These gains highlight the robustness and effectiveness of the discriminative triad methodology.

The efficiency of the approach is equally notable, being three times faster and using a third of the parameters compared to analogous methods. This computational efficiency, coupled with performance improvements, marks a significant advancement in the WREG domain.

Implications and Future Directions

The implications of this research are substantial for the development of generalized, scalable WREG systems capable of real-time application in domains such as interactive AI agents, autonomous systems, and advanced human-computer interfaces. The elimination of explicit training mappings paves the way for more flexible and versatile models that can adapt more readily to novel datasets.

Future explorations could enhance the discriminative triad approach by integrating more sophisticated natural language processing strategies to refine triad extraction and apply these methods beyond static imagery into video and other dynamic contexts. The modular nature of the presented framework inherently supports such expansions, suggesting a broad horizon of potential advancements in multimodal AI systems.

In conclusion, this paper presents a meaningful contribution to the field of weakly-supervised visual grounding, characterized by a sophisticated yet efficient framework, setting a precedent for future research in similar domains.