Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation (2003.08813v1)

Published 19 Mar 2020 in cs.CV

Abstract: Referring expression comprehension (REC) and segmentation (RES) are two highly-related tasks, which both aim at identifying the referent according to a natural language expression. In this paper, we propose a novel Multi-task Collaborative Network (MCN) to achieve a joint learning of REC and RES for the first time. In MCN, RES can help REC to achieve better language-vision alignment, while REC can help RES to better locate the referent. In addition, we address a key challenge in this multi-task setup, i.e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS). Specifically, CEM enables REC and RES to focus on similar visual regions by maximizing the consistency energy between two tasks. ASNLS supresses the response of unrelated regions in RES based on the prediction of REC. To validate our model, we conduct extensive experiments on three benchmark datasets of REC and RES, i.e., RefCOCO, RefCOCO+ and RefCOCOg. The experimental results report the significant performance gains of MCN over all existing methods, i.e., up to +7.13% for REC and +11.50% for RES over SOTA, which well confirm the validity of our model for joint REC and RES learning.

PDF Abstract

Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

The paper introduces a Multi-task Collaborative Network (MCN), engineered for the simultaneous handling of Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). These two tasks, while traditionally seen as separate in the computer vision landscape, fundamentally aim to identify a target referent based on linguistic expressions, thereby sharing an intertwined objective. The innovative aspect of MCN lies in its framework that not only accommodates the symbiosis of REC and RES but also significantly improves performance over state-of-the-art (SOTA) methods on benchmark datasets.

Major Contributions

The researchers designed the MCN to achieve mutual reinforcement between REC and RES. The MCN framework uniquely contributes by coordinating language-vision alignment in REC via aid from RES and enhancing instance location in RES using insights from REC. The notable developments within the MCN include:

Consistency Energy Maximization (CEM): This innovative loss function ensures both REC and RES tasks focus on congruent visual areas by boosting the inter-task consistency energy.
Adaptive Soft Non-Located Suppression (ASNLS): This mechanism refines the RES predictions based on REC outputs by suppressing responses from unrelated regions flexibly, overcoming the rigidity of conventional hard-processing methods.

Experimental Insights

MCN was rigorously validated on three benchmark datasets: RefCOCO, RefCOCO+, and RefCOCOg. The results depicted substantial performance improvements, achieving a boost of up to 7.13% for REC and 11.50% for RES over existing methods, marking it as a prominent step forward in joint REC and RES modeling. The proposed MCN framework displayed enhancements in tasks where the integration of pixel-level detail (RES) can inform bounding box precision (REC), and vice versa.

Additionally, the introduction of a new metric termed Inconsistency Error (IE) allowed for the quantification of prediction conflicts, an issue inherent in multitask systems. This measurement validated the effectiveness of CEM and ASNLS in reducing inter-task prediction discrepancies.

Framework and Design

The MCN architecture incorporates a partially shared framework where the branches for REC and RES leverage a shared visual backbone and language encoder, but maintain distinct inference streams. This separation prevents performance degradation typically induced by homogeneous network designs.

The language-centered collaborative network core of MCN integrates the following advantages:

Task Optimization: Using distinct scales for task-specific feature maps accommodates the unique requirements of REC (spatial resolution) and RES (instance resolution).
Collaborative Learning: By fostering an operational link between the tasks, facilitated through advanced network designs like CEM and through multi-scale feature fusion, MCN effectively harmonizes the individual demands of both tasks.

Implications and Future Directions

The MCN represents a pivotal step toward a more integrated approach to visual-linguistic co-processing, offering insights that may extend to other related task pairs such as object detection and segmentation. The framework hints at a trajectory where synergy is a pivotal factor in network design, possibly influencing future multi-task architectures to include mutual reinforcement mechanisms.

Going ahead, further refinement in the robustness of CEM and ASNLS, or exploration into more complex interaction schemas between REC and RES might provide additional increments in precision and efficiency. The evidence suggests that focusing on reducing prediction conflict is vital in the continued evolution of joint-learning approaches within computer vision tasks.

Conclusively, the MCN's design effectively bridges the gap between REC and RES, illustrating a nuanced understanding of how interconnected tasks can jointly enhance performance, with robust architectural design and a notable reduction in prediction conflicts.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Gen Luo (32 papers)
Yiyi Zhou (38 papers)
Xiaoshuai Sun (91 papers)
Liujuan Cao (73 papers)
Chenglin Wu (16 papers)
Cheng Deng (67 papers)
Rongrong Ji (315 papers)

Citations (262)

View on Semantic Scholar

Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation (2003.08813v1)