Fast-FoundationStereo: Rapid Stereo Matching

Updated 16 December 2025

Fast-FoundationStereo is a stereo matching architecture that fuses zero-shot robustness with real-time efficiency using compact CNN backbones and feature distillation.
It uses a modular pipeline with blockwise neural architecture search and structured pruning to achieve up to 24x speedup while maintaining near-teacher accuracy.
Large-scale pseudo-labeling and efficient pruning enable strong cross-domain generalization, making it ideal for robotics, autonomous driving, and dense 3D mapping.

Fast-FoundationStereo defines a new class of stereo matching architectures that combine the zero-shot robustness and cross-domain generalization of large-scale foundation models with real-time or near real-time computational efficiency. These methods address the long-standing trade-off in stereo vision between the accuracy and transferability of large foundation models and the speed and resource efficiency required for deployment in autonomous robotics, real-time perception, and large-scale mapping. The Fast-FoundationStereo pipeline is based on a modular, divide-and-conquer approach featuring feature distillation, architecture search, and structured pruning, augmented by large-scale pseudo-labeled data curation for domain generalization.

1. Divide-and-Conquer Acceleration Architecture

Fast-FoundationStereo integrates three key components to overcome the computational bottlenecks of foundation model-based stereo systems (Wen et al., 11 Dec 2025):

Knowledge Distillation: The FoundationStereo teacher network employs a hybrid backbone by fusing foundation-scale monocular depth features (DepthAnything V2, ViT) with CNN-based side-tuning into a 4-scale feature pyramid $\{\bar{f}_l(i), \bar{f}_r(i)\}$ . Fast-FoundationStereo replaces this with a compact, efficient CNN backbone (e.g., EdgeNeXt or MobileNetV2) trained via feature-level distillation:

$\mathcal{L}_{KD} = \sum_{i}\|\bar{f}_l(i) - f_l(i)\|_2^2 + \|\bar{f}_r(i) - f_r(i)\|_2^2$

This preserves monocular and stereo priors at order-of-magnitude lower inference cost.

Blockwise Neural Architecture Search (NAS): The cost aggregation/filtering module is decomposed into $N$ sequential blocks, each with a search space of candidate operations (e.g., 3D/2D conv, hourglass, transformer, excitation, variable depths). Each candidate block is trained to mimic the teacher via MSE loss and the global architecture is chosen via integer linear programming to optimize accuracy-latency tradeoff under a hard speedup constraint.
Structured Pruning of Iterative Refinement: The iterative ConvGRU-based disparity refinement module is pruned by leveraging a dependency graph over recurrent channels/layers and pruning low-importance components as determined by first-order Taylor analysis. Retraining with loss

$\mathcal{L}_{retrain} = \sum_{k=1}^K \gamma^{K-k} \|d_k - d^*\|_1 + \lambda\sum_{i=1}^L \|x_i - \bar{x}_i\|_2^2$

( $\gamma=0.9$ , $\lambda=0.1$ ) enables $>2\times$ speedup per iteration with negligible EPE degradation.

Collectively, this pipeline yields an overall $10\times$ to $24\times$ runtime speedup and $10\times$ compression over the teacher foundation model, with near-matching accuracy in zero-shot regimes.

2. Large-Scale Pseudo-Labeling and Generalization

Generalization to in-the-wild or out-of-distribution domains is achieved through an automatic pseudo-labeling strategy (Wen et al., 11 Dec 2025). Core steps include:

Curation of $1.4$ million rectified stereo pairs from the Stereo4D internet-scale dataset.
For each stereo pair, teacher FoundationStereo generates a disparity map ( $d^t$ ), and a monocular depth estimator (UniDepth V2) produces $z^m$ . Both are backprojected to point clouds with per-pixel surface normals.
Pixel-level cosine similarity $c = \langle n^t, n^m \rangle$ is used as a consistency mask; pixels with $c > \tau_c$ (typically $0.5$) are considered reliable, while sky regions are excluded via open-vocabulary segmentation.
Remaining pseudo-labels are used to train the student model, augmenting synthetic data for robust knowledge distillation.

This method achieves strong zero-shot transfer by capturing both the global structure from synthetic datasets and the diversity of real-world imagery.

3. Quantitative Benchmarks and Efficiency

Fast-FoundationStereo's performance spans multiple axes (Wen et al., 11 Dec 2025):

Efficiency: On NVIDIA 3090, single-model throughput is $49$ ms/frame (EdgeNeXt-S) or $21$ ms/frame (TensorRT), representing $10\times$ to $24\times$ speedup over FoundationStereo's baseline ($496$ ms/frame).
Accuracy: On standard benchmarks, zero-shot non-occluded error rates approach those of foundation models:
- Middlebury-Q BP-2: $2.12\%$
- ETH3D BP-2: $0.62\%$
- KITTI-15 D1: $3.43\%$
- These values are state-of-the-art among all real-time (≤ $50$ ms) stereo architectures.
Model Size: Student backbone ($3.4$–$7$M params), cost filtering ( $\sim8$ M), refinement ( $\sim2$ M) compared to $\sim200$ M for teacher.

The methods scale linearly up to high resolutions (e.g., 4K, 6K), with stable memory overhead and sub-second times (see also (Raza et al., 2021)).

4. Architectural Variants and Backbones

Fast-FoundationStereo supports a spectrum of design points depending on latency, memory, and accuracy requirements (Wen et al., 11 Dec 2025):

Backbone Choice: Student networks utilize resource-efficient CNNs (EdgeNeXt, MobileNetV2, EfficientNet) rather than vision transformers.
Cost Volume Aggregation: Blockwise NAS enables assignment of lightweight blocks (e.g., planar 3D conv, efficient transformers) in the cost aggregation stack, matched via per-block distillation.
Refinement Module: The ConvGRU-based iterative module retains teacher-level update structure but prunes redundant operations for inference cost reduction.

Hyperparameters and training regimes are optimized for each module: AdamW optimizer, batch sizes ($8$–$32$), $8$–$30$ epochs, gradual learning rate decay, and loss functions combining L1 and feature-based distillation.

5. Theoretical and Practical Context

Fast-FoundationStereo directly addresses the historic shortcomings of both classical fast stereo (piecewise planar/mesh, e.g., (Pillai et al., 2015)) and modern deep learning systems that incur high computational overheads for modest accuracy gains. Unlike displacement-invariant 2D conv architectures (Zhong et al., 2020) or fast signature-based 2D UNet models (Yee et al., 2019), Fast-FoundationStereo achieves both strong zero-shot cross-dataset generalization and real-time speed by leveraging foundation model inductive biases, large-scale pseudo-labeling, and systematic architectural compression.

Its plug-and-play modularity enables deployment in demanding time-critical applications:

Real-time robotics (dense SLAM/VO)
Autonomous driving 3D stacks (depth + detection + planning)
Augmented/virtual reality depth pipelines
Backbone for research in multi-view geometry where traditional 4D cost volumes are intractable

Compared to prior acceleration-optimized stereo pipelines:

Method	FPS	KITTI-15 D1	Model Size	Training Data
Fast-FoundationStereo	20–48	3.43%	3–7M + 8M	Synth + 1.4M pseudo
DICC (Zhong et al., 2020)	100	2.86%	2D UNet	Synthetic
Fast Deep Stereo (Yee et al., 2019)	48	3.08%	2D UNet	Synthetic/KITTI FT
iCFR/FRSNet (Raza et al., 2021)	16–60	1.40–2.80%	32–64C hourglass	Synthetic/KITTI FT

Fast-FoundationStereo is distinguished by achieving strong zero-shot cross-domain generalization characteristic of large-scale foundation models, which none of the prior sub-100 ms methods matched.

7. Limitations and Future Directions

A residual error gap of $\sim 0.5\%$ D1 remains to state-of-the-art non-real-time models (e.g., full FoundationStereo) in highly textured or high-resolution scenes (Wen et al., 11 Dec 2025). Occlusion reasoning is not explicitly handled in the student backbone, and refinement focuses on inlier confidence propagation. Further potential lies in integrating active left-right consistency during cost aggregation, advancing real-time uncertainty estimation, and extending the pseudo-labeling regime with additional self-supervised cues or active data selection.

The modular, distillation-accelerated paradigm of Fast-FoundationStereo defines a scalable template for compressing future foundation vision architectures into practical, deployable modules for real-time dense geometry perception.