RAFT Optical Flow Estimation

Updated 4 December 2025

Optical Flow (RAFT) is a deep learning technique that computes dense motion fields using an all-pairs correlation volume and recurrent refinement.
Multi-scale extensions such as MS-RAFT and MS-RAFT+ integrate hierarchical features and robust upsampling to improve accuracy and efficiency.
Advanced strategies like on-demand cost computation and block-sparse sampling drastically reduce memory usage and runtime, enhancing performance.

Optical flow estimation quantifies the dense motion field between two consecutive frames by computing the apparent displacement of image pixels. RAFT (Recurrent All-Pairs Field Transforms) constitutes a paradigm shift in modern learning-based flow estimation by leveraging an all-pairs correlation volume and iterative, recurrent refinement. Extensions such as Multi-Scale RAFT, MS-RAFT+, SEA-RAFT, PRAFlow, and efficient cost-volume sampling algorithms have further advanced accuracy, efficiency, scalability, and robustness. This article delineates the principles, architectural details, computational strategies, benchmark results, and recent innovations in the RAFT family and its multi-scale variants.

1. The RAFT Architecture: All-Pairs, Recurrent Regression

RAFT (Teed et al., 2020) establishes a foundational architecture for optical flow by encoding both input images into a compact feature space via a shared CNN. Features $\{F_1, F_2\}$ are computed at $\tfrac{1}{8}$ spatial resolution. The keystone is the dense 4D all-pairs correlation volume:

$C(i,j,k,\ell) = \langle F_1(i,j), F_2(k,\ell) \rangle$

With $H'$ and $W'$ the downsampled dimensions, $C \in \mathbb{R}^{H' \times W' \times H' \times W'}$ . RAFT constructs a four-level correlation pyramid by average pooling the correlation volume:

$\{C^1,\,C^2,\,C^3,\,C^4\}$

This enables matching across both small and large displacements, circumventing coarse-to-fine warping.

At each of $T$ iterations, RAFT refines its flow field $f_t$ by looking up local patches in the correlation pyramid around the current flow estimate. The update operator is a convolutional GRU that ingests the sampled cost patch, flow features, and per-pixel context from a separate encoder. The recurrence is:

$f_{t+1}(p) = f_t(p) + \Delta f_t(p),$

with training supervised by an exponentially weighted sum of endpoint errors over all iterates.

2. Multi-Scale Extensions: Hierarchical RAFT Variants

2.1. Multi-Scale RAFT (MS-RAFT)

MS-RAFT (Jahedi et al., 2022) integrates hierarchical concepts absent from vanilla RAFT: a partially shared coarse-to-fine pipeline spanning three scales (1/16, 1/8, 1/4), U-Net pyramid encoders for multi-scale features, hierarchical cost volumes, and a multi-scale, multi-iteration robust loss. At each scale $s$ , features are produced, the (local) correlation pyramid is constructed, and the shared GRU block runs $T_s$ iterations. Coarse-scale flow is upsampled via a learned convex mask to initialize the next finer scale. Losses from all scales and iterations contribute via exponential weights.

2.2. MS-RAFT+

MS-RAFT+ (Jahedi et al., 2022), which won the Robust Vision Challenge 2022, advances MS-RAFT by enabling a finer fourth scale (half-res S₄), utilizing on-demand cost computation to operate at higher resolution without prohibitive memory. All correlation costs at each pixel and iteration are computed only in a $[(2r+1)^2]$ window centered at the current flow $\mathbf u^l(x)$ :

$\mathrm{cost}^l(x,\delta) = F_1^l(x) \cdot F_2^l(x + u^l(x) + \delta)$

A shared convex combination upsampler (mask-based, non-bilinear) preserves fine motion boundaries and small displacement details across all upsampling transitions. Training employs a mixed fine-tuning schedule with robust sample-wise loss ( $q=0.7$ ), fostering cross-benchmark generalization.

2.3. PRAFlow

PRAFlow (Wan et al., 2020) uses a two-level pyramid (1/8, 1/4) for flow estimation, each with its own local cost volume and shared RAFT-GRU update block. The final flow is obtained via convex upsampling. PRAFlow demonstrates improved performance on small motion regions and occlusion boundaries compared to single-scale RAFT.

Comparative Table: Multi-Scale RAFT Variants

Model	#Scales	Upsampling	Cost Computation	Robust Loss	Highlight
RAFT	1 × 1/8	Convex combo	Precompute all-pairs	L1 seq.	Standard, high efficiency
MS-RAFT	3 (1/16–1/4)	Convex combo	Precompute local/pyramid	L1/L2 hybrid	Coarse-to-fine, U-Net fusion
MS-RAFT+	4 (to 1/2)	Convex combo	On-demand per-pixel	Robust $q=0.7$	1st RVC'22, best cross-benchmark
PRAFlow	2 (1/8,1/4)	Convex combo	Local, per-stage	L2 (coarse/fine)	2nd RVC'20, small motion/occlusion

3. Computational Strategies: Correlation Volume Sampling

The quadratic size of the all-pairs volume ( $\mathcal{O}(HW^2)$ ) limits RAFT-like models to moderate resolutions. On-demand sampling (as in MS-RAFT+) sidesteps memory blowup by computing only the required dot-products per iteration. However, this incurs runtime overhead.

Efficient block-sparse sampling (Briedis et al., 22 May 2025) refines this by precomputing only block subsets of the 4D volume that are actually sampled across all GRU iterations and pixels, leveraging significant overlaps. Blocks are constructed by partitioning feature maps, with a per-iteration mask indicating required block-pairs. Batched block GEMM (BSR-SpMM) produces required submatrices, which are cached and reused. This architecture matches the mathematical operator of vanilla RAFT, but provides up to 90% runtime reduction and 95% memory savings compared to precomputing, and matches on-demand approaches at comparable memory but dramatically higher speed.

4. Upsampling Methods: Convex Combination and Sparse Interpolation

RAFT employs a boundary-aware convex combination upsampler. Each low-res flow pixel’s contribution is weighted (non-negative, sum-to-one) to neighbors in the high-res grid:

$u^{l+1}(j) = \sum_{i\in\mathcal{N}(j)} w_{i\rightarrow j}\,u^{l}(i)$

MS-RAFT+, MS-RAFT, and PRAFlow share this learned upsampler across scales for efficient flow refinement and sharp motion boundaries.

Normalized Convolution Upsampling (NCUP) (Eldesokey et al., 2021) further refines this process by formulating upsampling as sparse interpolation guided by confidence maps, solving via cascaded normalized convolution layers. NCUP sharply delineates objects, improves cross-dataset generalization, and uses 80% fewer parameters than RAFT’s standard upsampler.

5. Robustness, Handling Outliers, and Generalization

MS-RAFT+’s robust sample-wise loss with $q=0.7$ downweights outlier-rich samples, improving cross-domain robustness. The multi-scale multi-iteration loss in MS-RAFT is critical; without it, error increases dramatically.

SEA-RAFT (Wang et al., 23 May 2024) introduces direct regression of an initial flow, rigid-motion pre-training (on TartanAir data), and a mixture of Laplace likelihood, reducing the number of required refinements and increasing both speed and cross-dataset accuracy. SEA-RAFT achieves up to $2.3\times$ speedup and $9-15\%$ lower EPE given comparable model size.

Unsupervised methods such as SMURF (Stone et al., 2021), based on RAFT, exploit self-teaching, multi-frame sequence-aware losses, photometric warping, and occlusion inpainting, achieving $36-40\%$ error reductions over prior unsupervised methods and outperforming multiple supervised networks.

6. Benchmarks and Quantitative Performance

MS-RAFT+ attained first place at RVC 2022, ranking first on VIPER, second on KITTI, Sintel, and Middlebury, with overall best accuracy (Jahedi et al., 2022). Compared to MS-RAFT, significant improvements were measured:

KITTI Fl-all: $4.15$ vs $4.88$ ( $-15.0\%$ )
Sintel Clean EPE: $1.232$ vs $1.374$ ( $-10.3\%$ )
Middlebury EPE: $0.142$ vs $0.184$ ( $-22.8\%$ )

SEA-RAFT reached $3.686$ EPE and $0.363$ 1px-outlier on Spring, further improving speed and accuracy (Wang et al., 23 May 2024). Efficient block-sparse sampling (Briedis et al., 22 May 2025) enabled native 8K flow estimation on commodity GPUs, reducing correlation sampling runtime by $90\%$ while maintaining accuracy.

7. Current Challenges and Future Directions

Recent avenues in the RAFT family include:

Global matching and overlapping attention (GMFlowNet (Zhao et al., 2022)) for large displacement improvements
Context-guided correlation volumes (CGCV (Li et al., 2022)) for robust matching under adverse visual conditions
Attention-based feature localization and amorphous lookup operators (Ef-RAFT (Eslami et al., 1 Jan 2024)) to address large motions and repetitive regions with lightweight extensions
Ultra-high-resolution deployment via block-sparse sampling and cascaded inference strategies (Briedis et al., 22 May 2025)

Persisting bottlenecks involve quadratic memory for naive correlation, efficiency of cost sampling at fine scale, occlusion modeling, and generalization across diverse visual domains. Promising future directions include correlation volume compression, adaptive pyramid depth selection, and cross-task generalization to related correspondence problems (stereo, scene flow, depth estimation).

The RAFT architecture, together with its multi-scale and efficient derivatives, forms the foundation of current state-of-the-art optical flow estimation, supporting accuracy, robustness, and scalability for both supervised and unsupervised pipelines across academic and practical benchmarks.