HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching (2007.12140v5)

Published 23 Jul 2020 in cs.CV

Abstract: This paper presents HITNet, a novel neural network architecture for real-time stereo matching. Contrary to many recent neural network approaches that operate on a full cost volume and rely on 3D convolutions, our approach does not explicitly build a volume and instead relies on a fast multi-resolution initialization step, differentiable 2D geometric propagation and warping mechanisms to infer disparity hypotheses. To achieve a high level of accuracy, our network not only geometrically reasons about disparities but also infers slanted plane hypotheses allowing to more accurately perform geometric warping and upsampling operations. Our architecture is inherently multi-resolution allowing the propagation of information across different levels. Multiple experiments prove the effectiveness of the proposed approach at a fraction of the computation required by state-of-the-art methods. At the time of writing, HITNet ranks 1st-3rd on all the metrics published on the ETH3D website for two view stereo, ranks 1st on most of the metrics among all the end-to-end learning approaches on Middlebury-v3, ranks 1st on the popular KITTI 2012 and 2015 benchmarks among the published methods faster than 100ms.

Citations (201)

View on Semantic Scholar

Summary

The paper introduces HITNet with a fast multi-resolution initialization that bypasses the need for costly full 3D cost volumes.
It leverages a 2D disparity propagation mechanism with slanted plane support for precise geometric reasoning and reliable depth estimation.
HITNet attains state-of-the-art results on benchmarks like KITTI and Middlebury while significantly lowering computational demands for real-time applications.

Overview of HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching

Stereo matching has long been a significant research focus, particularly within computer vision, due to its crucial role in depth perception algorithms, which are applicable in autonomous driving and robotics. The paper "HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching" introduces HITNet, an innovative neural network architecture that significantly advances real-time stereo matching capabilities by overcoming computational challenges common in traditional methods.

HITNet is distinctly characterized by its ability to operate efficiently without the explicit construction of a full 3D cost volume, which is a common yet computationally expensive approach in traditional stereo matching algorithms. Instead, it employs a strategy of multi-resolution initialization combined with 2D geometric propagation and warping mechanisms to derive disparity hypotheses.

Key Contributions and Methodology

HITNet employs several innovative strategies:

Fast Multi-resolution Initialization: The architecture features an initialization step that computes high-resolution matches using learned features without resorting to exhaustive computations of cost volumes. This approach maintains high accuracy with significantly reduced computational demands.
Efficient Disparity Propagation: HITNet utilizes a distinct 2D disparity propagation mechanism that incorporates slanted plane hypotheses. This mechanism, supported by slanted support windows, allows for highly accurate geometric reasoning and facilitates precise geometric warping and upsampling operations.
End-to-end Learning Architecture: The entire process is embedded in an end-to-end learning framework, allowing for efficient training with features flowing through the network to enhance performance.
State-of-the-art Performance with Reduced Computation: HITNet achieves top ranks on several benchmarks such as the KITTI 2012 and 2015, ETH3D, and Middlebury-v3 with a fraction of the computational cost compared to existing methods.

Theoretical and Practical Implications

The theoretical advancements introduced by HITNet include a departure from the heavy reliance on full 3D cost volumes and the introduction of slanted plane warping to predict disparities. These innovations present a considerable leap in efficiency for neural architecture design in stereo matching, offering a blueprint for future network designs that prioritize computational efficiency without sacrificing accuracy.

Practically, HITNet has clear implications for applications requiring rapid yet accurate depth estimation, such as in autonomous driving where latency is critical. The reduction in computational demand translates directly to faster processing times, enabling real-time applications to operate more effectively under constrained processing environments.

Future Directions

While HITNet has demonstrated significant progress, future research could focus on improving self-supervised learning methods and exploring self-distillation to further reduce the necessity of extensive ground truth data. There is also potential to investigate how the architecture could be scaled or adapted for broader applications in 3D perception beyond stereo matching, possibly integrating additional sensory inputs.

In summary, HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching offers a robust framework that balances computational efficiency with high accuracy, setting a new standard for real-time stereo matching algorithms. The paper not only provides a well-substantiated approach to stereo matching but also contributes valuable insights and methodologies that may steer future innovations in real-time depth estimation technologies.

PDF Markdown