Multi-Scale RAFT for Dense Visual Correspondence
- Multi-Scale RAFT (MS-RAFT) is a dense visual correspondence framework that uses a hierarchical, coarse-to-fine strategy and multi-scale feature extraction to capture both large displacements and fine details.
- It leverages a Siamese U-Net for multi-resolution feature encoding and employs ConvGRU-based recurrent update blocks with learned convex upsampling to refine motion estimates.
- MS-RAFT demonstrates significant accuracy gains over single-scale RAFT in optical flow and 3D scene flow benchmarks, showing its effectiveness in real-world motion estimation tasks.
Multi-Scale RAFT (MS-RAFT) generalizes the Recurrent All-Pairs Field Transforms (RAFT) paradigm for dense visual correspondence by integrating classical hierarchical processing concepts, explicitly incorporating multi-scale feature extraction, hierarchical cost volumes, and coarse-to-fine optimization into RAFT’s recurrent, all-pairs matching backbone. MS-RAFT and its successors—including MS-RAFT+ and MS-RAFT-3D—demonstrate that classical hierarchical refinements remain effective when carefully synthesized with modern network-based correlation and update mechanisms for both optical flow and image-based scene flow (Jahedi et al., 2022, Jahedi et al., 2022, Schmid et al., 2 Jun 2025).
1. Hierarchical Multi-Scale Architecture
MS-RAFT employs a hierarchical, coarse-to-fine design where the estimation process unfolds across a pyramid of spatial scales. For optical flow, MS-RAFT typically employs three scales at 1/16, 1/8, and 1/4 input resolution, and MS-RAFT+ extends to a fourth, 1/2 resolution scale. At each scale , the recurrent network predicts a flow or motion field , initialized to zero at the coarsest scale and upsampled from the previous scale at finer levels. Every scale recursively refines the estimate using a fixed number of ConvGRU-based update iterations, followed by learning-based, convex upsampling to initialize the next-finer-level estimate (Jahedi et al., 2022, Jahedi et al., 2022, Schmid et al., 2 Jun 2025). This process allows large displacements to be robustly captured at coarse scales and fine structures recovered at high resolution.
In scene flow (MS-RAFT-3D), the backbone is a recurrent SE(3) field estimator (as in RAFT-3D), hierarchically nested across scales. The recurrent update block (for SE(3) and rigid motion embeddings) is identical at each scale, with upsampling of both SE(3) fields and embeddings to propagate state across the hierarchy (Schmid et al., 2 Jun 2025).
2. Multi-Scale Feature and Context Encoders
The feature encoders in MS-RAFT use a Siamese U-Net style backbone, generating multi-resolution features for each input frame. Successive conv–leaky-ReLU–downsampling blocks process each frame, yielding feature maps at each scale (1/16, 1/8, 1/4, [optionally 1/2]). Feature enhancement at each level fuses upsampled coarser features with scale-specific detail using residual units (Jahedi et al., 2022). The context encoder (for conditioning the update block) employs either a top-down hierarchy with learned channel widths per scale or a simplified sequence of residual blocks, leading to significant model-size reductions over earlier deep backbones (e.g., ResNet50 in RAFT-3D) with improved accuracy (Schmid et al., 2 Jun 2025).
Channel width choices and U-Net-based designs empirically demonstrate lower error compared to flat or untied context-encoders (Schmid et al., 2 Jun 2025, Jahedi et al., 2022).
3. Hierarchical Cost Volume and Recurrent Update Block
At each scale, MS-RAFT constructs a local cost volume via inner-product (correlation) between feature vectors for small displacement windows. Unlike RAFT, which works only at a single fixed feature scale (typically 1/8), MS-RAFT generalizes the local cost window construction to all levels of the spatial pyramid (Jahedi et al., 2022). In MS-RAFT-3D, the 4D cost volume is built via all-pairs feature correlation at each scale, reflecting both optical flow and stereo cues for scene flow (Schmid et al., 2 Jun 2025).
The update block at each scale is a ConvGRU-based module with shared parameters across pyramid levels. The update module ingests current flow/motion estimate, context features, correlation features (looked up from the local cost volume), and—in scene flow—additional embeddings (e.g., rigid-motion, confidence, edge weight estimates). Corrections to the current estimate, embedding updates, and smoothing/regularization coefficients are derived from the updated hidden state in each recurrent step.
Convex upsampling modules, trained end-to-end, output non-negative spatial weights enabling sub-pixel-accurate, learned interpolation to finer scales, outperforming bilinear upsampling in accuracy (Jahedi et al., 2022, Jahedi et al., 2022).
4. Training Loss and Optimization
MS-RAFT uses a fully multi-scale, multi-iteration loss that supervises each predicted flow (or scene flow) field at every iteration of every scale, generalizing the “deep supervision” found in RAFT. The typical loss at scale and iteration is, for flow,
with . Losses are exponentially weighted to emphasize later scales/iterations. Fine-tuning incorporates a robust, sample-wise loss parametrized by (see (Jahedi et al., 2022, Jahedi et al., 2022)).
Scene flow extends the target to (2D flow, depth-residual), adds weighted terms for confidence and reverse flows, and supervises via a generalized multi-scale formulation (Schmid et al., 2 Jun 2025).
Optimization leverages pretraining on synthetic datasets (e.g., FlyingChairs, FlyingThings3D), followed by mixed or dataset-specific fine-tuning, with no additional data augmentation beyond original RAFT.
5. Key Empirical Results and Ablations
Extensive quantitative evaluation of MS-RAFT verifies substantial accuracy gains over single-scale RAFT and comparable baselines:
| Benchmark | RAFT | MS-RAFT | MS-RAFT+ |
|---|---|---|---|
| VIPER (Test) EPE | 0.58 | 0.49 | 0.41 |
| KITTI Fl-all | 5.10 | 4.88 | 4.15 |
| Sintel Clean EPE | 1.43 | 1.37 | 1.23 |
| Sintel Final (small) | 0.55 | 0.47 | 0.42 |
MS-RAFT+ achieves an overall first place among all methods in the Robust Vision Challenge 2022 (Jahedi et al., 2022). Notably, the largest relative improvements occur on non-occluded regions and “small motion” subsets (Jahedi et al., 2022, Jahedi et al., 2022). Scene flow variants (MS-RAFT-3D, MS-RAFT-3D+) yield relative accuracy gains of 8.7% on KITTI and 65.8% on Spring over previous state of the art, while reducing model size by ∼17 million parameters (Schmid et al., 2 Jun 2025).
Ablation studies confirm the necessity of multi-scale processing, learned upsampling, and U-Net feature encoders. Disabling bi-Laplacian embedding smoothing in scene flow nearly doubles error, underscoring the importance of spatial coherence priors (Schmid et al., 2 Jun 2025).
6. Implementation and Computational Trade-offs
A multi-scale configuration (typically S=3, with scaling factors 1/16, 1/8, 1/4 and up to 1/2 for MS-RAFT+) with partially-shared update blocks is optimal, balancing parameter efficiency and prediction accuracy (Jahedi et al., 2022, Jahedi et al., 2022). MS-RAFT increases parameter count (13.5 M vs 5.3 M for RAFT) and inference latency (0.3 s/pair vs 0.09 s) but remains real-time on modern hardware. MS-RAFT+ introduces on-demand cost computation with custom CUDA kernels to feasibly support a 1/2 resolution scale without prohibitive memory cost (Jahedi et al., 2022).
The convex upsampler, shared across scales, is lightweight yet critical for accurate high-resolution estimation (Jahedi et al., 2022).
7. Extensions and Broader Impact
The MS-RAFT design paradigm has been successfully generalized to 3D scene flow (MS-RAFT-3D), where dense SE(3) fields are inferred over multi-scale hierarchies. The approach is shown to outperform previous scene flow methods by a large margin, demonstrating the continued utility and extensibility of hierarchical, recurrent, all-pairs estimation (Schmid et al., 2 Jun 2025).
MS-RAFT further demonstrates that blending classical multi-scale/hierarchical strategies with strong modern backbones (RAFT) yields state-of-the-art accuracy and improves generalization across datasets without the need for elaborate fine-tuning schedules (Jahedi et al., 2022, Jahedi et al., 2022). A plausible implication is that similar hybrid designs may be beneficial across other dense correspondence and motion estimation tasks.