RAFT-Stereo: Iterative Deep Stereo Matching

Updated 21 April 2026

RAFT-Stereo is a deep learning model for dense stereo correspondence that iteratively refines disparity estimates using a recurrent update operator.
It achieves memory-efficient matching by utilizing 2D convolutions on narrow cost slices instead of memory-intensive 3D cost volumes.
Extensions enhance accuracy by addressing frequency convergence, integrating sparse LiDAR guidance, and unifying stereo with flow and depth estimation.

RAFT-Stereo is a deep learning architecture for dense stereo correspondence that adapts the RAFT (Recurrent All-Pairs Field Transforms) paradigm to the rectified stereo matching domain. It introduces an efficient, recurrent iterative update operator that refines disparity estimates using multi-scale feature representations and reduces reliance on memory-intensive 3D cost volume regularization. RAFT-Stereo’s flexible architectural core forms the foundation for several subsequent advances in stereo and multi-view matching and has inspired numerous extensions addressing its limitations in frequency convergence, cross-modal fusion, and global context regulation.

1. Core Architecture and Iterative Update Mechanism

RAFT-Stereo receives a rectified stereo pair $(I_L, I_R) \in \mathbb{R}^{H \times W \times 3}$ and encodes both images via a shared CNN feature extractor to obtain feature maps $f_L$ and $f_R$ (Lipson et al., 2021, Lin et al., 3 Dec 2025). An all-pairs correlation volume

$C(u, v, d) = \langle f_L(u, v), f_R(u-d, v) \rangle$

is constructed for a range of disparities $d$ per scanline.

Disparity refinement proceeds by initializing $d_0 = 0$ and a context-based hidden state. At each iteration $k$ , the current disparity hypothesis $d_{k-1}$ is used to select a narrow cost slice $C_k$ from the correlation volume around the candidate disparity. A Gated Recurrent Unit (GRU) or ConvGRU, conditioned on the context features and the cost slice, predicts a residual update $\Delta d_k$ : $f_L$ 0

$f_L$ 1

After $f_L$ 2 iterations (typically 22–32), $f_L$ 3 is output as the final disparity field (Lipson et al., 2021). The network is fully differentiable and end-to-end trainable, with a sequential loss supervising each intermediate disparity estimate.

Multilevel RAFT-Stereo versions maintain hidden states at several spatial resolutions, allowing rapid propagation across both fine and coarse scales (Lipson et al., 2021). All operations—feature extraction, cost computation, update—are implemented as 2D convolutions, enabling memory- and compute-efficiency relative to traditional 3D cost volume approaches.

2. Frequency Convergence Inconsistency and Wavelet-Based Remedies

A fundamental limitation identified in RAFT-Stereo is the convergence imbalance between low-frequency (smooth areas) and high-frequency (edges, thin structures) disparity regions during iterative refinement (Wei et al., 23 May 2025). Empirically, the End-Point Error (EPE) in high-frequency regions persists above 0.72 px after all iterations, while low-frequency EPE converges below 0.4 px. The root cause is the indiscriminate, joint optimization of all frequency components by the recurrent operator.

Wavelet-Stereo addresses this by explicit frequency band separation. A Haar Discrete Wavelet Transform (DWT) decomposes the input images into progressively downsampled sub-bands: $f_L$ 4 (low-frequency) and $f_L$ 5 (high-frequency, at multiple orientations). Separate feature extractors process each band: a U-shaped CNN for high-frequency, a standard context encoder for low-frequency content.

A new High-frequency Preservation Update Operator (HPU), composed of an Iterative Frequency Adapter (IFA; alternating cross-attention between frequency features) and a High-frequency Preservation LSTM, adapts and injects refined high-frequency cues into each update. This architecture enables simultaneous edge preservation and smooth-region refinement.

Quantitatively, this frequency-aware design narrows the EPE gap (e.g., 0.56 px for high-frequency, 0.35 px for low-frequency) and significantly boosts leaderboard performance (e.g., KITTI 2012 Out-3: 1.07% best) and efficiency (reaching baseline RAFT-Stereo accuracy in one-quarter the iterations) (Wei et al., 23 May 2025).

3. Sparse LiDAR Fusion and Guided RAFT-Stereo

RAFT-Stereo’s architecture enables late or early fusion of external sparse depth guidance, e.g., from LiDAR, but vanilla injection of sparse disparity values into $f_L$ 6 is ineffective when LiDAR is extremely sparse (hundreds of points) (Yoo et al., 26 Jul 2025). In such cases, cost-slice retrieval is dominated by zero disparity, and the limited guidance acts as high-frequency noise—subsequently suppressed by the network’s recurrent operations.

The Guided RAFT-Stereo (GRAFT-Stereo) framework handles this via a depth pre-fill strategy:

For late fusion, sparse disparities are interpolated (row-wise nearest-neighbor or linear) or completed via a small U-Net, yielding a dense prior $f_L$ 7.
For early fusion, pre-filled depth fields are projected to $f_L$ 8 coordinates, concatenated to image features, and input to the encoder. Only the top- $f_L$ 9 most confident predictions are retained.

This pre-filling transforms the guidance signal, ensuring the iterative network can effectively propagate and refine guided depth. Empirically, GRAFT-Stereo achieves a $f_R$ 0 absolute reduction in Bad1 error versus the best prior method under 300-point LiDAR, and substantial improvements in both RMSE and MAE. Modularity is preserved: only minor architecture changes and no extra loss terms are required, and performance gains hold across varying sparsity and diverse datasets (Yoo et al., 26 Jul 2025).

4. Extensions: Global Context, Multi-View, Unified Models, and Adversarial Robustness

Global Context Regulation

RAFT-Stereo’s local iterative model can struggle in ill-posed regions lacking geometric texture or exhibiting occlusions. The GREAT (Global Regulation and Excitation via Attention Tuning) framework addresses this by inserting three attention modules: Spatial Attention (captures scene-wide context within feature maps), Matching Attention (injects cross-view context along epipolar lines), and Volume Attention (modulates the raw cost volume per disparity slice) (Li et al., 19 Sep 2025). The result is a “globally tuned” cost landscape at each recurrent step.

Quantitative results show up to $f_R$ 1 EPE reduction on Scene-Flow, and consistent improvements on all major real-world benchmarks with negligible runtime and parameter overhead versus baseline RAFT-Stereo (Li et al., 19 Sep 2025).

Multiview Extensions

CER-MVS (Cascaded Epipolar RAFT for Multiview Stereo) generalizes RAFT-Stereo to multiple calibrated images. It constructs epipolar cost volumes for each neighbor view, employs cost-volume cascades for memory efficiency, and fuses multi-resolution disparity maps with adaptive thresholds for robust point cloud reconstruction (Ma et al., 2022).

Unified Correspondence Models

Recent research demonstrates that RAFT-Stereo’s paradigm can be unified with flow and generic depth estimation within a single Transformer-based architecture. Cross-attention augmented features enable parameter-free 1D matching for stereo, and follow-up RAFT-style refinement steps further improve accuracy. These unified models outperform RAFT-Stereo in both error and speed, using fewer sequential updates and parameters (e.g., EPE=0.77 vs 0.86 on Scene-Flow for four refinements), highlighting strong architectural transferability (Xu et al., 2022).

Adversarial Vulnerability

RAFT-Stereo exhibits susceptibility to physical adversarial patch attacks. DepthVanish, an optimized stripe-texture patch, exploits periodic ambiguity in the cost volume and the recurrent update’s assumption of smooth residuals to drive predicted disparity to zero (infinite depth) over the attacked region. A clean D1 error of 2% rises to 89% under such attacks, confirmed in real-world experiments (Xing et al., 20 Jun 2025). Robustness remains an open challenge: adversarially trained models and detection heuristics are potential mitigations.

5. Benchmarks, Generalization, and Limitations

RAFT-Stereo demonstrates high accuracy across standard stereo datasets. On Middlebury, it achieved a 9.37% Bad 1px error—29% lower than previous methods—and on ETH3D, 2.44% Bad 1px (Lipson et al., 2021). However, cross-dataset generalization is sensitive to the implementation and coordinate conventions. On ETH3D, an incorrect disparity range configuration led to systematic negative disparities, a catastrophic error with EPE ≈ 26 px and 98% error rate (Lin et al., 3 Dec 2025). When corrected, performance matches expectations.

In UAV forest environments, iterative models like RAFT-Stereo preserve fine detail (e.g., branches) but may produce more speckle and noise than foundation-model approaches; methods like DEFOM display superior smoothness and occlusion handling in highly unstructured or vegetation-dense domains (Lin et al., 3 Dec 2025).

Ablation studies indicate iterative models benefit from frequency-aware, attention-centric, or external depth-injected enhancements. Runtime and efficiency trade-offs are modulated by the number of recurrent steps (e.g., 7.6 FPS at 32 iterations, up to 26 FPS for real-time variants) (Lipson et al., 2021).

6. Training, Implementation, and Practical Considerations

RAFT-Stereo and its variants utilize a consistent training schedule: AdamW optimization, one-cycle or exponential learning rate schedules, and photometric/geometric data augmentations. Supervision is applied to every intermediate disparity estimate with an exponentially decaying weight schedule, usually via $f_R$ 2 or robust regression losses (Lipson et al., 2021, Wei et al., 23 May 2025).

Resource requirements remain modest: memory is constrained by the compact representation (2D convolutions on cost slices), allowing inference at high resolutions on consumer GPUs. Custom CUDA samplers for the cost lookup further accelerate deployment.

Practical usage recommends attention to dataset conventions, augmentation for out-of-domain robustness, and potential integration of uncertainty modeling when fusing external guidance. Public codebases are available for standard, wavelet-based, attention-augmented, and LiDAR-guided RAFT-Stereo variants (Lipson et al., 2021, Wei et al., 23 May 2025, Li et al., 19 Sep 2025, Yoo et al., 26 Jul 2025).

References:

(Lipson et al., 2021) RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching
(Wei et al., 23 May 2025) A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency
(Yoo et al., 26 Jul 2025) Leveraging Sparse LiDAR for RAFT-Stereo: A Depth Pre-Fill Perspective
(Li et al., 19 Sep 2025) Global Regulation and Excitation via Attention Tuning for Stereo Matching
(Ma et al., 2022) Multiview Stereo with Cascaded Epipolar RAFT
(Xu et al., 2022) Unifying Flow, Stereo and Depth Estimation
(Xing et al., 20 Jun 2025) DepthVanish: Optimizing Adversarial Interval Structures for Stereo-Depth-Invisible Patches
(Lin et al., 3 Dec 2025) Generalization Evaluation of Deep Stereo Matching Methods for UAV-Based Forestry Applications