Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Supervised Learning for Stereo Matching with Self-Improving Ability (1709.00930v1)

Published 4 Sep 2017 in cs.CV

Abstract: Exiting deep-learning based dense stereo matching methods often rely on ground-truth disparity maps as the training signals, which are however not always available in many situations. In this paper, we design a simple convolutional neural network architecture that is able to learn to compute dense disparity maps directly from the stereo inputs. Training is performed in an end-to-end fashion without the need of ground-truth disparity maps. The idea is to use image warping error (instead of disparity-map residuals) as the loss function to drive the learning process, aiming to find a depth-map that minimizes the warping error. While this is a simple concept well-known in stereo matching, to make it work in a deep-learning framework, many non-trivial challenges must be overcome, and in this work we provide effective solutions. Our network is self-adaptive to different unseen imageries as well as to different camera settings. Experiments on KITTI and Middlebury stereo benchmark datasets show that our method outperforms many state-of-the-art stereo matching methods with a margin, and at the same time significantly faster.

Citations (159)

Summary

  • The paper introduces a deep network that learns stereo matching without ground-truth, using image warping error as the main training signal.
  • The architecture features symmetric feature extraction, 3D convolutions for cost regularization, and a soft argmin for generating continuous disparity maps.
  • The self-improving ability enables online fine-tuning on new data, significantly enhancing performance across diverse environments.

This paper introduces SsSMnet, a deep convolutional neural network for dense stereo matching that learns entirely without ground-truth disparity maps. Instead of minimizing the difference between predicted and ground-truth disparities, it minimizes the photometric image warping error between the left and right stereo images. This self-supervised approach allows the network to be trained end-to-end using only stereo image pairs and enables a unique "self-improving" capability, where the network can adapt and fine-tune itself when presented with new, unseen data online.

Core Idea: Self-Supervision via Image Warping

The fundamental principle is to use the consistency between stereo views as the learning signal. Given a predicted disparity map for the right image, dRd_R, the left image ILI_L can be warped to synthesize the right image IRI_R'.

IR(u,v)=IL(u+dR(u,v),v)I_R'(u,v) = I_L(u+d_R(u,v), v)

The difference between the synthesized right image IRI_R' and the actual right image IRI_R forms the basis of the loss function. A symmetric process is applied to synthesize the left image from the right. The network learns the function ff such that dL=f(IL,IR)d_L = f(I_L, I_R) and dR=f(IR,IL)d_R = f(I_R, I_L) minimize this warping error.

Network Architecture

The network (Figure 3 in the paper) consists of five main modules, processing the stereo pair symmetrically:

  1. Feature Extraction:
    • Uses 18 convolutional layers (3x3 kernels) with residual connections every 3 layers to extract 64-dimensional feature maps from both ILI_L and IRI_R.
    • Implementation: Crucially, weights are shared between the left and right feature extractors to ensure symmetric processing.
  2. Feature Volume Construction:
    • Instead of a traditional cost volume, a richer feature volume is built.
    • For left-to-right matching at pixel (u,v)(u,v) and disparity dd, it concatenates the left feature vector fL(u,v)f_L(u,v) with the corresponding disparity-shifted right feature vector fR(ud,v)f_R(u-d, v):

      $F^{LR}(u,v,d) = f_L(u,v) \concat f_R(u-d,v)$

* A symmetric volume FRLF^{RL} is constructed for right-to-left matching. * Dimensions: Height×Width×(Dmax+1)×(2×FeatureDimension)Height \times Width \times (D_{max}+1) \times (2 \times FeatureDimension), where DmaxD_{max} is the maximum disparity considered.

  1. 3D Feature Matching (Regularization):
    • Processes the 4D feature volume using 3D convolutions to aggregate information spatially and across disparities.
    • Uses a Res-TDM (Residually connected Top-Down Module) (Figure 5):
      • Bottom-up Path: Applies stride-2 3D convolutions (3x3x3 kernels) to downsample the volume.
      • Top-down Path: Applies 3D deconvolutions to upsample back to the original resolution.
      • Residual Connections: Skip connections with residual blocks (two 3x3x3 stride-1 3D convolutions) link corresponding scales in the bottom-up and top-down paths.
    • Output: A regularized 3D volume (H×W×(Dmax+1)H \times W \times (D_{max}+1)), effectively representing the matching cost/probability for each disparity at each pixel after contextual aggregation.
  2. Soft Argmin:
    • Converts the 3D cost volume into a 2D disparity map in a differentiable way.
    • Calculates the expected disparity value by taking a weighted sum over disparities, where weights are determined by the softmax of the negative costs from the previous module:

      d(u,v)=d=0Dmaxd×σ(cd(u,v))d(u,v) = \sum_{d=0}^{D_{max}} d \times \sigma(-c_d(u,v))

      where cdc_d is the output of the Res-TDM module for disparity dd, and σ\sigma is the softmax function applied across the disparity dimension.

  3. Warping Loss Evaluation: This module is conceptual; the loss is calculated based on the output disparity maps and the input images.

Loss Function

The self-supervised loss is a weighted sum of several components, applied symmetrically for left and right disparity predictions (L=i{u,s,c,m}ωi(Lil+Lir)\mathcal{L} = \sum_{i \in \{u, s, c, m\}} \omega_i (\mathcal{L}_i^l + \mathcal{L}_i^r)):

  1. Unary Photometric Loss (Lu\mathcal{L}_u): Measures the difference between the original image and the warped image. It combines three terms:
    • Structural Similarity Index (SSIM): λ11S(I,I)2\lambda_1 \frac{1-\mathcal{S}(I, I')}{2}
    • L1 Pixel Difference: λ2II\lambda_2 |I - I'|
    • L1 Gradient Difference: λ3II\lambda_3 |\nabla I - \nabla I'|
    • Implementation: Uses bilinear sampling for image warping to ensure differentiability. Weights used: λ1=0.80,λ2=0.15,λ3=0.15\lambda_1 = 0.80, \lambda_2 = 0.15, \lambda_3 = 0.15.
  2. Disparity Smoothness Loss (Ls\mathcal{L}_s): Encourages locally smooth disparities, especially in textureless regions.
    • Uses a weighted Total Generalized Variation (TGV) like term on the disparity map (dLd_L), penalized less at image edges (where disparity changes are expected):

      Lsl=1Nu2dLeu2IL+v2dLev2IL\mathcal{L}_s^l = \frac{1}{N}\sum\left| \nabla^2_u d_L\right| e^{-\left| \nabla^2_u I_L\right|}+ \left| \nabla^2_v d_L\right| e^{-\left| \nabla^2_v I_L\right|}

  3. Loop Consistency Loss (Lc\mathcal{L}_c): A key contribution for coupling the left and right disparity predictions.
    • Idea: Synthesize the left image ILI_L' by warping IRI_R using dRd_R. Synthesize another version ILI_L'' by warping ILI_L to the right view using dLd_L, and then warping the result back to the left view using dRd_R.
    • Loss: Minimizes the difference ILIL|I_L - I_L''|. This ensures consistency between the forward and backward warping processes using both dLd_L and dRd_R.
    • Implementation: This term is crucial for making the symmetric network structure effective.
  4. Maximum-Depth Heuristic Loss (Lm\mathcal{L}_m): Regularizes textureless regions by encouraging smaller disparities (larger depths).
    • Loss: Minimizes the L1 norm of the predicted disparity map: LmL=1NdL\mathcal{L}_m^L = \frac{1}{N}\sum\left| d^L\right|.

Loss Weights: ωc=1\omega_c = 1, ωm=0.001\omega_m = 0.001. ωs\omega_s starts low (0.001\le 0.001) to avoid trivial solutions (all max disparity) and can be increased to $0.1$ after initial convergence. ωp\omega_p is likely 1 (balancing the terms).

Self-Improving Ability

  • Because training doesn't require ground truth, the network can continue learning (fine-tuning) on new stereo pairs encountered during deployment ("online tuning").
  • The same self-supervised loss function is used to update the network weights based on the new data.
  • Experiments show that a model trained on KITTI (outdoor driving) significantly improves its performance on Middlebury (indoor scenes) after only 100 iterations of online tuning on Middlebury data (Table 1).
  • The network can achieve reasonable performance even when trained from random initialization in about 1000-1500 iterations (Figure 7, 8).

Implementation Details

  • Framework: TensorFlow.
  • Optimizer: RMSProp.
  • Learning Rate: 1×1031 \times 10^{-3} initially, reduced to 1×1041 \times 10^{-4} after 5000 iterations.
  • Training Data: Random crops (256×512256 \times 512) from KITTI raw dataset (normalized pixel values 0-1). No data augmentation.
  • Batch Size: 1 (due to GPU memory constraints).
  • Disparity Range: 160 pixels (Dmax=159D_{max}=159).
  • Runtime: ~0.8 seconds for inference on a 384×1280384 \times 1280 stereo pair; ~1.6 seconds with online tuning (on a Titan Xp GPU).

Applications and Significance

  • Provides a method for training high-performance deep stereo networks without expensive ground-truth data, making it applicable in scenarios where such data is unavailable.
  • The self-improving ability allows the network to adapt to new environments, lighting conditions, and camera setups encountered during deployment, potentially improving robustness in real-world applications like robotics and autonomous driving.
  • Achieves state-of-the-art results compared to both traditional and supervised deep learning methods at the time of publication, particularly demonstrating strong cross-dataset generalization when using online tuning.