Self-Supervised Learning for Stereo Matching with Self-Improving Ability (1709.00930v1)

Published 4 Sep 2017 in cs.CV

Abstract: Exiting deep-learning based dense stereo matching methods often rely on ground-truth disparity maps as the training signals, which are however not always available in many situations. In this paper, we design a simple convolutional neural network architecture that is able to learn to compute dense disparity maps directly from the stereo inputs. Training is performed in an end-to-end fashion without the need of ground-truth disparity maps. The idea is to use image warping error (instead of disparity-map residuals) as the loss function to drive the learning process, aiming to find a depth-map that minimizes the warping error. While this is a simple concept well-known in stereo matching, to make it work in a deep-learning framework, many non-trivial challenges must be overcome, and in this work we provide effective solutions. Our network is self-adaptive to different unseen imageries as well as to different camera settings. Experiments on KITTI and Middlebury stereo benchmark datasets show that our method outperforms many state-of-the-art stereo matching methods with a margin, and at the same time significantly faster.

Citations (159)

View on Semantic Scholar

Summary

The paper introduces a deep network that learns stereo matching without ground-truth, using image warping error as the main training signal.
The architecture features symmetric feature extraction, 3D convolutions for cost regularization, and a soft argmin for generating continuous disparity maps.
The self-improving ability enables online fine-tuning on new data, significantly enhancing performance across diverse environments.

This paper introduces SsSMnet, a deep convolutional neural network for dense stereo matching that learns entirely without ground-truth disparity maps. Instead of minimizing the difference between predicted and ground-truth disparities, it minimizes the photometric image warping error between the left and right stereo images. This self-supervised approach allows the network to be trained end-to-end using only stereo image pairs and enables a unique "self-improving" capability, where the network can adapt and fine-tune itself when presented with new, unseen data online.

Core Idea: Self-Supervision via Image Warping

The fundamental principle is to use the consistency between stereo views as the learning signal. Given a predicted disparity map for the right image, $d_R$ , the left image $I_L$ can be warped to synthesize the right image $I_R'$ .

$I_R'(u,v) = I_L(u+d_R(u,v), v)$

The difference between the synthesized right image $I_R'$ and the actual right image $I_R$ forms the basis of the loss function. A symmetric process is applied to synthesize the left image from the right. The network learns the function $f$ such that $d_L = f(I_L, I_R)$ and $d_R = f(I_R, I_L)$ minimize this warping error.

Network Architecture

The network (Figure 3 in the paper) consists of five main modules, processing the stereo pair symmetrically:

Feature Extraction:
- Uses 18 convolutional layers (3x3 kernels) with residual connections every 3 layers to extract 64-dimensional feature maps from both $I_L$ and $I_R$ .
- Implementation: Crucially, weights are shared between the left and right feature extractors to ensure symmetric processing.
Feature Volume Construction:
- Instead of a traditional cost volume, a richer feature volume is built.
- For left-to-right matching at pixel $(u,v)$ and disparity $d$ , it concatenates the left feature vector $f_L(u,v)$ with the corresponding disparity-shifted right feature vector $f_R(u-d, v)$ :
  
  $F^{LR}(u,v,d) = f_L(u,v) \concat f_R(u-d,v)$

* A symmetric volume $F^{RL}$ is constructed for right-to-left matching. * Dimensions: $Height \times Width \times (D_{max}+1) \times (2 \times FeatureDimension)$ , where $D_{max}$ is the maximum disparity considered.

3D Feature Matching (Regularization):
- Processes the 4D feature volume using 3D convolutions to aggregate information spatially and across disparities.
- Uses a Res-TDM (Residually connected Top-Down Module) (Figure 5):
  - Bottom-up Path: Applies stride-2 3D convolutions (3x3x3 kernels) to downsample the volume.
  - Top-down Path: Applies 3D deconvolutions to upsample back to the original resolution.
  - Residual Connections: Skip connections with residual blocks (two 3x3x3 stride-1 3D convolutions) link corresponding scales in the bottom-up and top-down paths.
- Output: A regularized 3D volume ( $H \times W \times (D_{max}+1)$ ), effectively representing the matching cost/probability for each disparity at each pixel after contextual aggregation.
Soft Argmin:
- Converts the 3D cost volume into a 2D disparity map in a differentiable way.
- Calculates the expected disparity value by taking a weighted sum over disparities, where weights are determined by the softmax of the negative costs from the previous module:
  
  $d(u,v) = \sum_{d=0}^{D_{max}} d \times \sigma(-c_d(u,v))$
  
  where $c_d$ is the output of the Res-TDM module for disparity $d$ , and $\sigma$ is the softmax function applied across the disparity dimension.
Warping Loss Evaluation: This module is conceptual; the loss is calculated based on the output disparity maps and the input images.

Loss Function

The self-supervised loss is a weighted sum of several components, applied symmetrically for left and right disparity predictions ( $\mathcal{L} = \sum_{i \in \{u, s, c, m\}} \omega_i (\mathcal{L}_i^l + \mathcal{L}_i^r)$ ):

Unary Photometric Loss ( $\mathcal{L}_u$ ): Measures the difference between the original image and the warped image. It combines three terms:
- Structural Similarity Index (SSIM): $\lambda_1 \frac{1-\mathcal{S}(I, I')}{2}$
- L1 Pixel Difference: $\lambda_2 |I - I'|$
- L1 Gradient Difference: $\lambda_3 |\nabla I - \nabla I'|$
- Implementation: Uses bilinear sampling for image warping to ensure differentiability. Weights used: $\lambda_1 = 0.80, \lambda_2 = 0.15, \lambda_3 = 0.15$ .
Disparity Smoothness Loss ( $\mathcal{L}_s$ ): Encourages locally smooth disparities, especially in textureless regions.
- Uses a weighted Total Generalized Variation (TGV) like term on the disparity map ( $d_L$ ), penalized less at image edges (where disparity changes are expected):
  
  $\mathcal{L}_s^l = \frac{1}{N}\sum\left| \nabla^2_u d_L\right| e^{-\left| \nabla^2_u I_L\right|}+ \left| \nabla^2_v d_L\right| e^{-\left| \nabla^2_v I_L\right|}$
Loop Consistency Loss ( $\mathcal{L}_c$ ): A key contribution for coupling the left and right disparity predictions.
- Idea: Synthesize the left image $I_L'$ by warping $I_R$ using $d_R$ . Synthesize another version $I_L''$ by warping $I_L$ to the right view using $d_L$ , and then warping the result back to the left view using $d_R$ .
- Loss: Minimizes the difference $|I_L - I_L''|$ . This ensures consistency between the forward and backward warping processes using both $d_L$ and $d_R$ .
- Implementation: This term is crucial for making the symmetric network structure effective.
Maximum-Depth Heuristic Loss ( $\mathcal{L}_m$ ): Regularizes textureless regions by encouraging smaller disparities (larger depths).
- Loss: Minimizes the L1 norm of the predicted disparity map: $\mathcal{L}_m^L = \frac{1}{N}\sum\left| d^L\right|$ .

Loss Weights: $\omega_c = 1$ , $\omega_m = 0.001$ . $\omega_s$ starts low ( $\le 0.001$ ) to avoid trivial solutions (all max disparity) and can be increased to $0.1$ after initial convergence. $\omega_p$ is likely 1 (balancing the terms).

Self-Improving Ability

Because training doesn't require ground truth, the network can continue learning (fine-tuning) on new stereo pairs encountered during deployment ("online tuning").
The same self-supervised loss function is used to update the network weights based on the new data.
Experiments show that a model trained on KITTI (outdoor driving) significantly improves its performance on Middlebury (indoor scenes) after only 100 iterations of online tuning on Middlebury data (Table 1).
The network can achieve reasonable performance even when trained from random initialization in about 1000-1500 iterations (Figure 7, 8).

Implementation Details

Framework: TensorFlow.
Optimizer: RMSProp.
Learning Rate: $1 \times 10^{-3}$ initially, reduced to $1 \times 10^{-4}$ after 5000 iterations.
Training Data: Random crops ( $256 \times 512$ ) from KITTI raw dataset (normalized pixel values 0-1). No data augmentation.
Batch Size: 1 (due to GPU memory constraints).
Disparity Range: 160 pixels ( $D_{max}=159$ ).
Runtime: ~0.8 seconds for inference on a $384 \times 1280$ stereo pair; ~1.6 seconds with online tuning (on a Titan Xp GPU).

Applications and Significance

Provides a method for training high-performance deep stereo networks without expensive ground-truth data, making it applicable in scenarios where such data is unavailable.
The self-improving ability allows the network to adapt to new environments, lighting conditions, and camera setups encountered during deployment, potentially improving robustness in real-world applications like robotics and autonomous driving.
Achieves state-of-the-art results compared to both traditional and supervised deep learning methods at the time of publication, particularly demonstrating strong cross-dataset generalization when using online tuning.

PDF Markdown

Self-Supervised Learning for Stereo Matching with Self-Improving Ability (1709.00930v1)

Summary

Related Papers