TiCoSS: Tightening the Coupling between Semantic Segmentation and Stereo Matching within A Joint Learning Framework (2407.18038v4)

Published 25 Jul 2024 in cs.CV and cs.RO

Abstract: Semantic segmentation and stereo matching, respectively analogous to the ventral and dorsal streams in our human brain, are two key components of autonomous driving perception systems. Addressing these two tasks with separate networks is no longer the mainstream direction in developing computer vision algorithms, particularly with the recent advances in large vision models and embodied artificial intelligence. The trend is shifting towards combining them within a joint learning framework, especially emphasizing feature sharing between the two tasks. The major contributions of this study lie in comprehensively tightening the coupling between semantic segmentation and stereo matching. Specifically, this study introduces three novelties: (1) a tightly coupled, gated feature fusion strategy, (2) a hierarchical deep supervision strategy, and (3) a coupling tightening loss function. The combined use of these technical contributions results in TiCoSS, a state-of-the-art joint learning framework that simultaneously tackles semantic segmentation and stereo matching. Through extensive experiments on the KITTI and vKITTI2 datasets, along with qualitative and quantitative analyses, we validate the effectiveness of our developed strategies and loss function, and demonstrate its superior performance compared to prior arts, with a notable increase in mIoU by over 9%. Our source code will be publicly available at mias.group/TiCoSS upon publication.

Collections

Summary

The paper introduces the TiCoSS framework that jointly learns semantic segmentation and stereo matching, achieving over a 9% mIoU improvement on the KITTI dataset.
It employs a Tightly-Coupled Gated Feature Fusion and Hierarchical Deep Supervision strategy to effectively integrate geometric and contextual features.
Experimental results demonstrate enhanced handling of occlusions and complex boundaries, paving the way for future autonomous driving and multi-task learning research.

TiCoSS: Joint Learning Framework for Improved Semantic Segmentation and Stereo Matching

The paper "TiCoSS: Tightening the Coupling between Semantic Segmentation and Stereo Matching within A Joint Learning Framework" introduces a novel approach to enhance the synergy between semantic segmentation and stereo matching tasks, particularly for autonomous driving applications. The research builds upon the understanding that these two tasks are analogous to the ventral and dorsal streams in the human visual system, responsible for contextual and geometric scene understanding, respectively. Traditional methods have treated these tasks independently, limiting their potential for mutual information sharing. This paper presents a tightly-coupled framework, TiCoSS, which addresses this limitation through joint learning strategies aimed at maximizing the performance of both tasks.

Technical Contributions

The TiCoSS framework introduces three key innovations:

Tightly-Coupled Gated Feature Fusion (TGF) Strategy: This strategy enhances feature extraction by selectively integrating geometric features from disparity maps into contextual features from RGB images at each neural network layer. The gated approach effectively reduces noise and ensures that only relevant information is fused, preserving the quality of the segmentation task. The Selective Inheritance Gates (SIGs) are pivotal, allowing the model to differentiate and propagate the most informative features to subsequent layers.
Hierarchical Deep Supervision (HDS) Strategy: To counteract the vanishing gradient issue and improve model convergence, the authors propose HDS, which uses finely detailed, high-resolution fused features to guide auxiliary classifier branches, enhancing the interaction between semantic and geometric data. This methodology ensures better gradient flow, resulting in more robust segmentation and disparity estimation.
Coupling Tightening (CT) Loss Function: The CT loss function strengthens the relationship between the two tasks at the output level by incorporating a Disparity Inconsistency-Aware (DIA) loss and a Deep Supervision Consistency Constraint (DSCC) loss, in addition to employing pre-existing stereo matching losses. These components ensure that the complementary nature of semantic segmentation and stereo matching is exploited to its fullest potential.

Experimental Results

The efficacy of the TiCoSS framework is substantiated through comprehensive experiments on the KITTI and vKITTI2 datasets, providing both qualitative and quantitative evaluations. TiCoSS surpasses previous state-of-the-art (SoTA) methods in semantic segmentation by significant margins, achieving improvements of over 9% in mean Intersection over Union (mIoU) on the KITTI dataset. The framework also offers enhancements in disparity estimation, with improvements in average End-Point Error (EPE) and reduced disparity inconsistencies. These results underscore the model's ability to produce fine-grained and detailed segmentation outputs while maintaining high accuracy in depth perception, particularly in challenging scenarios involving occlusions and complex object boundaries.

Implications and Future Work

The proposed framework represents a notable advancement in joint learning approaches, demonstrating the practical benefits of task synergy in computer vision applications. The innovations highlighted in this work, particularly in feature fusion and loss formulation, provide a blueprint for further research in multi-task learning. The authors suggest potential future directions, including extending the framework to semi-supervised or few-shot learning paradigms, which could alleviate the dependency on large annotated datasets. Additionally, optimizing computational complexity for real-time deployment in autonomous systems remains a critical area for ongoing development.

In conclusion, TiCoSS presents a refined integration of semantic segmentation and stereo matching, setting a new benchmark in joint learning frameworks. This research contributes significantly to the field by providing robust methodologies and clear evidence of the enhanced performance benefits derived from tighter task coupling.