RAFT Optical Flow Estimation
- Optical Flow (RAFT) is a deep learning technique that computes dense motion fields using an all-pairs correlation volume and recurrent refinement.
- Multi-scale extensions such as MS-RAFT and MS-RAFT+ integrate hierarchical features and robust upsampling to improve accuracy and efficiency.
- Advanced strategies like on-demand cost computation and block-sparse sampling drastically reduce memory usage and runtime, enhancing performance.
Optical flow estimation quantifies the dense motion field between two consecutive frames by computing the apparent displacement of image pixels. RAFT (Recurrent All-Pairs Field Transforms) constitutes a paradigm shift in modern learning-based flow estimation by leveraging an all-pairs correlation volume and iterative, recurrent refinement. Extensions such as Multi-Scale RAFT, MS-RAFT+, SEA-RAFT, PRAFlow, and efficient cost-volume sampling algorithms have further advanced accuracy, efficiency, scalability, and robustness. This article delineates the principles, architectural details, computational strategies, benchmark results, and recent innovations in the RAFT family and its multi-scale variants.
1. The RAFT Architecture: All-Pairs, Recurrent Regression
RAFT (Teed et al., 2020) establishes a foundational architecture for optical flow by encoding both input images into a compact feature space via a shared CNN. Features are computed at spatial resolution. The keystone is the dense 4D all-pairs correlation volume:
With and the downsampled dimensions, . RAFT constructs a four-level correlation pyramid by average pooling the correlation volume:
This enables matching across both small and large displacements, circumventing coarse-to-fine warping.
At each of iterations, RAFT refines its flow field by looking up local patches in the correlation pyramid around the current flow estimate. The update operator is a convolutional GRU that ingests the sampled cost patch, flow features, and per-pixel context from a separate encoder. The recurrence is:
with training supervised by an exponentially weighted sum of endpoint errors over all iterates.
2. Multi-Scale Extensions: Hierarchical RAFT Variants
2.1. Multi-Scale RAFT (MS-RAFT)
MS-RAFT (Jahedi et al., 2022) integrates hierarchical concepts absent from vanilla RAFT: a partially shared coarse-to-fine pipeline spanning three scales (1/16, 1/8, 1/4), U-Net pyramid encoders for multi-scale features, hierarchical cost volumes, and a multi-scale, multi-iteration robust loss. At each scale , features are produced, the (local) correlation pyramid is constructed, and the shared GRU block runs iterations. Coarse-scale flow is upsampled via a learned convex mask to initialize the next finer scale. Losses from all scales and iterations contribute via exponential weights.
2.2. MS-RAFT+
MS-RAFT+ (Jahedi et al., 2022), which won the Robust Vision Challenge 2022, advances MS-RAFT by enabling a finer fourth scale (half-res Sâ‚„), utilizing on-demand cost computation to operate at higher resolution without prohibitive memory. All correlation costs at each pixel and iteration are computed only in a window centered at the current flow :
A shared convex combination upsampler (mask-based, non-bilinear) preserves fine motion boundaries and small displacement details across all upsampling transitions. Training employs a mixed fine-tuning schedule with robust sample-wise loss (), fostering cross-benchmark generalization.
2.3. PRAFlow
PRAFlow (Wan et al., 2020) uses a two-level pyramid (1/8, 1/4) for flow estimation, each with its own local cost volume and shared RAFT-GRU update block. The final flow is obtained via convex upsampling. PRAFlow demonstrates improved performance on small motion regions and occlusion boundaries compared to single-scale RAFT.
Comparative Table: Multi-Scale RAFT Variants
| Model | #Scales | Upsampling | Cost Computation | Robust Loss | Highlight |
|---|---|---|---|---|---|
| RAFT | 1 × 1/8 | Convex combo | Precompute all-pairs | L1 seq. | Standard, high efficiency |
| MS-RAFT | 3 (1/16–1/4) | Convex combo | Precompute local/pyramid | L1/L2 hybrid | Coarse-to-fine, U-Net fusion |
| MS-RAFT+ | 4 (to 1/2) | Convex combo | On-demand per-pixel | Robust | 1st RVC'22, best cross-benchmark |
| PRAFlow | 2 (1/8,1/4) | Convex combo | Local, per-stage | L2 (coarse/fine) | 2nd RVC'20, small motion/occlusion |
3. Computational Strategies: Correlation Volume Sampling
The quadratic size of the all-pairs volume () limits RAFT-like models to moderate resolutions. On-demand sampling (as in MS-RAFT+) sidesteps memory blowup by computing only the required dot-products per iteration. However, this incurs runtime overhead.
Efficient block-sparse sampling (Briedis et al., 22 May 2025) refines this by precomputing only block subsets of the 4D volume that are actually sampled across all GRU iterations and pixels, leveraging significant overlaps. Blocks are constructed by partitioning feature maps, with a per-iteration mask indicating required block-pairs. Batched block GEMM (BSR-SpMM) produces required submatrices, which are cached and reused. This architecture matches the mathematical operator of vanilla RAFT, but provides up to 90% runtime reduction and 95% memory savings compared to precomputing, and matches on-demand approaches at comparable memory but dramatically higher speed.
4. Upsampling Methods: Convex Combination and Sparse Interpolation
RAFT employs a boundary-aware convex combination upsampler. Each low-res flow pixel’s contribution is weighted (non-negative, sum-to-one) to neighbors in the high-res grid:
MS-RAFT+, MS-RAFT, and PRAFlow share this learned upsampler across scales for efficient flow refinement and sharp motion boundaries.
Normalized Convolution Upsampling (NCUP) (Eldesokey et al., 2021) further refines this process by formulating upsampling as sparse interpolation guided by confidence maps, solving via cascaded normalized convolution layers. NCUP sharply delineates objects, improves cross-dataset generalization, and uses 80% fewer parameters than RAFT’s standard upsampler.
5. Robustness, Handling Outliers, and Generalization
MS-RAFT+’s robust sample-wise loss with downweights outlier-rich samples, improving cross-domain robustness. The multi-scale multi-iteration loss in MS-RAFT is critical; without it, error increases dramatically.
SEA-RAFT (Wang et al., 23 May 2024) introduces direct regression of an initial flow, rigid-motion pre-training (on TartanAir data), and a mixture of Laplace likelihood, reducing the number of required refinements and increasing both speed and cross-dataset accuracy. SEA-RAFT achieves up to speedup and lower EPE given comparable model size.
Unsupervised methods such as SMURF (Stone et al., 2021), based on RAFT, exploit self-teaching, multi-frame sequence-aware losses, photometric warping, and occlusion inpainting, achieving error reductions over prior unsupervised methods and outperforming multiple supervised networks.
6. Benchmarks and Quantitative Performance
MS-RAFT+ attained first place at RVC 2022, ranking first on VIPER, second on KITTI, Sintel, and Middlebury, with overall best accuracy (Jahedi et al., 2022). Compared to MS-RAFT, significant improvements were measured:
- KITTI Fl-all: $4.15$ vs $4.88$ ()
- Sintel Clean EPE: $1.232$ vs $1.374$ ()
- Middlebury EPE: $0.142$ vs $0.184$ ()
SEA-RAFT reached $3.686$ EPE and $0.363$ 1px-outlier on Spring, further improving speed and accuracy (Wang et al., 23 May 2024). Efficient block-sparse sampling (Briedis et al., 22 May 2025) enabled native 8K flow estimation on commodity GPUs, reducing correlation sampling runtime by while maintaining accuracy.
7. Current Challenges and Future Directions
Recent avenues in the RAFT family include:
- Global matching and overlapping attention (GMFlowNet (Zhao et al., 2022)) for large displacement improvements
- Context-guided correlation volumes (CGCV (Li et al., 2022)) for robust matching under adverse visual conditions
- Attention-based feature localization and amorphous lookup operators (Ef-RAFT (Eslami et al., 1 Jan 2024)) to address large motions and repetitive regions with lightweight extensions
- Ultra-high-resolution deployment via block-sparse sampling and cascaded inference strategies (Briedis et al., 22 May 2025)
Persisting bottlenecks involve quadratic memory for naive correlation, efficiency of cost sampling at fine scale, occlusion modeling, and generalization across diverse visual domains. Promising future directions include correlation volume compression, adaptive pyramid depth selection, and cross-task generalization to related correspondence problems (stereo, scene flow, depth estimation).
The RAFT architecture, together with its multi-scale and efficient derivatives, forms the foundation of current state-of-the-art optical flow estimation, supporting accuracy, robustness, and scalability for both supervised and unsupervised pipelines across academic and practical benchmarks.