FoundationStereo: Zero-Shot Stereo Matching (2501.09898v4)

Published 17 Jan 2025 in cs.CV, cs.LG, and cs.RO

Abstract: Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization - a haLLMark of foundation models in other computer vision tasks - remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation. Project page: https://nvlabs.github.io/FoundationStereo/

Summary

The paper introduces the large-scale, diverse, self-curated FoundationStereo Dataset (FSD) with 1 million stereo pairs to train a model capable of zero-shot generalization across various domains.
It proposes a novel architecture combining a Side-Tuning Adapter (STA) leveraging real-world priors from a frozen monocular model with Attentive Hybrid Cost Filtering (AHCF) for efficient cost processing and long-range context.
FoundationStereo achieves state-of-the-art zero-shot performance on multiple benchmarks like KITTI, Middlebury, and ETH3D, often matching or exceeding methods that are fine-tuned on target datasets.

Introduction and Motivation

FoundationStereo (2501.09898) addresses the challenge of zero-shot generalization in deep learning-based stereo matching. While existing methods demonstrate strong performance on specific benchmarks, they often necessitate per-domain fine-tuning, limiting their applicability in diverse, unseen environments. This contrasts with foundation models in other vision tasks which exhibit robust generalization. The primary motivation is to develop a stereo matching model capable of high accuracy across various domains (indoor, outdoor, driving scenarios, synthetic datasets) without requiring fine-tuning on the target domain data, thereby establishing a foundation model for stereo depth estimation. Previous attempts at zero-shot stereo matching were often limited by the scale and diversity of training data (e.g., reliance on Scene Flow) or architectural constraints that hindered effective cross-domain transfer.

The FoundationStereo Dataset (FSD)

A core contribution is the creation of a large-scale, diverse synthetic dataset, the FoundationStereo Dataset (FSD), specifically designed to foster generalization.

Scale and Generation: FSD comprises 1 million stereo pairs, substantially larger than prior synthetic datasets used for stereo training. It was generated using NVIDIA Omniverse, employing RTX path-tracing to achieve high photorealism.
Diversity and Realism: Significant effort was invested in ensuring diversity through extensive domain randomization. This includes variations in camera intrinsics (focal length), stereo baseline, camera perspectives, lighting conditions, and object configurations. Over 5,000 high-quality 3D assets were utilized across 12 large scene models, incorporating diverse textures and materials. The dataset includes structured scenes (indoor/outdoor) and randomized object arrangements ("flying objects"), deliberately incorporating challenging stereo phenomena like reflections, textureless surfaces, severe occlusions, and varying camera parameters.
Iterative Self-Curation: To enhance dataset quality and mitigate ambiguities inherent in synthetic generation, an automated self-curation pipeline was implemented. An initial model trained on FSD is used to evaluate the dataset itself. Samples exhibiting high error rates (defined operationally as BP-2 > 60%) or identified ambiguities (e.g., texture repetition, problematic lighting) are flagged and replaced with newly generated samples. This iterative process (performed twice) refines both the dataset and the model's robustness concurrently.

Network Architecture

FoundationStereo incorporates several architectural innovations designed for scalability and effective feature processing, particularly for zero-shot generalization.

Side-Tuning Adapter (STA)

To bridge the sim-to-real gap and leverage powerful priors from models trained on large-scale real-world data, a Side-Tuning Adapter (STA) mechanism is employed.

Mechanism: It adapts features from a frozen, pre-trained vision foundation model, specifically the ViT-Large based monocular depth estimator DepthAnythingV2-L. A lightweight CNN (EdgeNeXt-S) is trained in parallel ("side-tuning"). Intermediate features from the frozen ViT (extracted before its final prediction head) are downscaled (via bilinear interpolation) and concatenated channel-wise with features from the trainable CNN at the 1/4 resolution level.
Purpose: This allows the stereo network to benefit from the rich semantic and geometric understanding encoded within the monocular foundation model, learned from vast amounts of real imagery, without incurring the cost of training the large ViT backbone. The trainable CNN learns to adapt these priors specifically for the stereo matching task.
Application: STA is used for extracting multi-level unary features from the input stereo pair and also provides context features for the iterative refinement stage, using a distinct CNN structure for the latter. The paper empirically validates this specific side-tuning configuration against alternatives.

Attentive Hybrid Cost Filtering (AHCF)

Effective aggregation of information within the cost volume is crucial, especially given the diverse scenes and potentially large disparity ranges targeted by a foundation model. AHCF combines several techniques for this purpose.

Hybrid Cost Volume: Constructed at 1/4 resolution, it combines group-wise correlation (using 32 groups) and feature concatenation. Group-wise correlation captures diverse matching signals effectively, while feature concatenation preserves the rich unary features derived from the STA module, including the adapted monocular priors.
Axial-Planar Convolution (APC) Filtering: To manage the computational and memory costs associated with 3D convolutions on potentially large cost volumes (especially with large disparity search ranges), APC is used. It factorizes a standard 3D convolution (e.g., 3x3x3) into two sequential steps: a spatial convolution (K_s x K_s x 1) applied across spatial dimensions for each disparity plane, followed by a disparity convolution (1 x 1 x K_d) applied along the disparity axis for each spatial location. This decoupling allows for a larger receptive field, particularly along the disparity dimension ( $K_d$ ), within an hourglass network structure used for cost filtering, without incurring the cubic complexity of standard 3D kernels.
Disparity Transformer (DT): To explicitly model long-range dependencies across the disparity dimension, a Disparity Transformer module is introduced. The 4D cost volume is processed by downsampling spatially, reshaping it such that disparity ( $D$ ) becomes the sequence length dimension for transformer blocks ( $H/k \times W/k, D, C$ ). Multi-head self-attention (specifically FlashAttention for efficiency) is then applied along this disparity dimension. This allows each spatial location to attend to information across the entire disparity range, facilitating global context reasoning, which is beneficial for resolving ambiguities in repetitive textures or large textureless regions. The output features from the DT are fused (via summation) with the features from the APC-based hourglass filtering network before disparity regression.

An initial disparity map is estimated from the filtered cost volume using a soft-argmin operation. This initial estimate is then refined iteratively using GRU (Gated Recurrent Unit) blocks. Each GRU update step utilizes motion features derived by warping the right image's features based on the current disparity estimate, features looked up from the filtered hybrid cost volume and a standard correlation volume, and context features provided by the STA module. Coarse-to-fine updates and an attention mechanism within the GRU stages further enhance the refinement process, allowing the model to leverage multi-modal features for progressive accuracy improvement.

Zero-Shot Generalization Performance

The combination of the large-scale, curated FSD dataset, the integration of real-world priors via STA, and the advanced AHCF architecture enables FoundationStereo to achieve strong zero-shot generalization across a wide range of benchmarks.

Dataset Contribution: The FSD dataset's scale and diversity provide the model with exposure to varied conditions, preventing overfitting to specific synthetic characteristics and enhancing robustness to domain shifts. The self-curation process further improves data quality, leading to better generalization.
Architectural Contribution: STA effectively mitigates the sim-to-real gap by injecting knowledge from DepthAnythingV2, improving performance on real-world images, particularly in challenging regions. AHCF, through APC and DT, provides the capacity to process large disparity ranges and long-range dependencies efficiently and effectively, crucial for handling diverse real-world scenes. APC manages computational cost for large receptive fields, while DT captures global disparity context.
Results: FoundationStereo demonstrates state-of-the-art zero-shot performance on datasets like Middlebury, ETH3D, KITTI 2015, and Argoverse 1.1 Stereo. Notably, its zero-shot performance often matches or surpasses methods that are specifically fine-tuned on the target datasets. For instance, on the KITTI 2015 benchmark, it achieves a 1.84% F1-all error rate zero-shot, competitive with top fine-tuned methods. Ablation studies confirm the significant contributions of FSD, STA, APC, and DT to the overall zero-shot performance. The model also shows strong qualitative results on challenging in-the-wild images.

Implementation Considerations

Model Components: The architecture leverages a frozen ViT-L (DepthAnythingV2) and trains a smaller CNN (EdgeNeXt-S based) alongside the cost processing and refinement networks. This side-tuning approach balances leveraging large pre-trained models with manageable training requirements for the stereo-specific parts.
Computational Cost: While ViT-L inference adds computational overhead compared to purely CNN-based approaches, it is only performed once per image. The use of APC significantly reduces the memory and computation compared to standard 3D convolutions for cost filtering, making larger receptive fields feasible. The Disparity Transformer utilizes FlashAttention to mitigate the quadratic complexity of self-attention, making long-range context modeling practical.
Deployment: As a zero-shot model, FoundationStereo eliminates the need for dataset collection and fine-tuning for new deployment domains, offering significant practical advantages in scenarios where target domain data is scarce or unavailable. The model predicts disparity up to a maximum range (e.g., 256 or 512 pixels, depending on configuration), which needs to be considered based on application requirements and input resolution. Inference requires sufficient GPU memory to accommodate the cost volume and network activations.

Conclusion

FoundationStereo presents a significant advancement in stereo matching by achieving strong zero-shot generalization across diverse domains. This is accomplished through the synergistic combination of a large-scale, high-fidelity, self-curated synthetic dataset (FSD), the effective integration of monocular priors from a frozen vision foundation model via side-tuning (STA), and a sophisticated cost filtering architecture (AHCF) employing Axial-Planar Convolutions and a Disparity Transformer to handle large disparity ranges and long-range context efficiently. Its ability to perform competitively without domain-specific fine-tuning marks a step towards more robust and universally applicable stereo depth estimation systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/bowenwen_me/status/1900294355317121166

https://twitter.com/Chandra88Moon/status/1881982833566400610

https://twitter.com/CSVisionPapers/status/1881460428024234395

https://twitter.com/simulately12492/status/1881203597440106885

https://twitter.com/Soumikgreen/status/1900291981978353763