Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adaptive Unimodal Cost Volume Filtering for Deep Stereo Matching (1909.03751v2)

Published 9 Sep 2019 in cs.CV

Abstract: State-of-the-art deep learning based stereo matching approaches treat disparity estimation as a regression problem, where loss function is directly defined on true disparities and their estimated ones. However, disparity is just a byproduct of a matching process modeled by cost volume, while indirectly learning cost volume driven by disparity regression is prone to overfitting since the cost volume is under constrained. In this paper, we propose to directly add constraints to the cost volume by filtering cost volume with unimodal distribution peaked at true disparities. In addition, variances of the unimodal distributions for each pixel are estimated to explicitly model matching uncertainty under different contexts. The proposed architecture achieves state-of-the-art performance on Scene Flow and two KITTI stereo benchmarks. In particular, our method ranked the $1{st}$ place of KITTI 2012 evaluation and the $4{th}$ place of KITTI 2015 evaluation (recorded on 2019.8.20). The codes of AcfNet are available at: https://github.com/DeepMotionAIResearch/DenseMatchingBenchmark.

Citations (168)

Summary

  • The paper introduces adaptive unimodal cost volume filtering to directly constrain cost volumes and reduce overfitting in deep stereo matching.
  • Key technical contributions include adaptive variance estimation via a Confidence Estimation Network (CENet) and a novel Stereo Focal Loss to handle positive/negative disparity imbalance.
  • The proposed method achieves state-of-the-art performance, ranking first on KITTI 2012 and demonstrating significant improvements over baselines like PSMNet on Scene Flow.

Adaptive Unimodal Cost Volume Filtering for Deep Stereo Matching

The paper "Adaptive Unimodal Cost Volume Filtering for Deep Stereo Matching" by Zhang et al. addresses a significant challenge in the domain of stereo vision—specifically, the overfitting issues arising from under-constrained cost volumes in deep learning-based stereo matching algorithms. In contrast to the prevalent approach of treating disparity estimation as a regression problem, the authors introduce a method to directly apply constraints to the cost volume through an adaptive unimodal filtering mechanism. This approach seeks to improve the alignment of cost volume distributions with true disparities.

Key Contributions

  1. Unimodal Cost Volume Filtering: The authors propose to filter the cost volumes using unimodal distributions that peak at true disparities. This direct supervision aims to constrain the usually unconstrained cost volumes in deep stereo matching models, thus reducing overfitting.
  2. Adaptive Variance Estimation: To accommodate varying levels of confidence across different image regions, the proposed method estimates the variance of the unimodal distribution for each pixel. The variance estimation is handled by a Confidence Estimation Network (CENet), ensuring that the sharpness of the unimodal distribution reflects the uncertainty of pixel matching.
  3. Stereo Focal Loss: A novel Stereo Focal Loss is introduced to address the imbalance between the small number of positive matches (true disparities) and the overwhelming number of negative disparities. This focal loss formulation provides more attention to pixels with significant uncertainty, thereby reinforcing the learning of correct disparity estimation.

Numerical Results

The evaluation results reflect that the model outperforms existing methods, securing first place on the KITTI 2012 stereo benchmark and fourth place on the KITTI 2015 evaluation. These evaluations demonstrate notable advancements over PSMNet and GC-Net, the baseline models used for comparison.

On the Scene Flow dataset, the authors report improvements in both End-Point Error (EPE) and the percentage of pixels with errors greater than 3 pixels (3PE) for both occluded and non-occluded areas when comparing their method, AcfNet, to PSMNet. Notably, AcfNet achieves an EPE of 0.867 px and achieves superior performance in handling occluded regions due to its adaptive confidence modeling.

Theoretical and Practical Implications

This method represents a shift in the standard practice of disparity estimation by focusing on cost volume through unimodal supervision and adaptive modeling of matching uncertainties. The theoretical construct of using a per-pixel variance to adapt the learning process under varying photometric and geometric conditions contributes significantly to understanding and mitigating the inherent weaknesses in end-to-end stereo networks. Practically, this has notable implications in domains relying heavily on precise depth estimation, like autonomous driving and augmented reality, where ambiguity in disparity estimation can lead to significant consequences.

Future Directions

The novel approach of adaptive cost volume filtering might inspire future research into fine-tuning cost distributions using other probabilistic models or exploring other methods for confidence interval estimation. Moreover, integrating multi-sensor fusion to complement this framework could extend its applicability and robustness in scenarios with challenging environmental conditions.

In essence, the proposed Adaptive Unimodal Cost Volume Filtering framework by Zhang et al. offers an insightful contribution towards enhancing the accuracy and generalization of deep stereo matching networks, suggesting promising directions for future exploration in 3D visual perception technologies.