Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlowIt: Hierarchical Transformer for Optical Flow

Updated 2 April 2026
  • FlowIt is a deep learning architecture for optical flow estimation that integrates a hierarchical transformer backbone with entropy-regularized optimal transport.
  • It employs a four-stage pipeline—including CNN-FPN feature extraction, global matching, confidence-guided correction, and GRU-based iterative refinement—for precise flow estimation.
  • FlowIt achieves state-of-the-art accuracy on benchmarks like Sintel and KITTI, demonstrating superior cross-dataset zero-shot generalization and robust performance.

FlowIt is a deep learning architecture for optical flow estimation that addresses large pixel displacements by introducing a hierarchical transformer backbone and a global matching formulation based on entropy-regularized optimal transport. The architecture analytically derives occlusion and confidence maps, which are used for confidence-guided refinement, resulting in state-of-the-art accuracy and superior cross-dataset zero-shot generalization. FlowIt achieves top results on benchmarks such as Sintel and KITTI, as well as on challenging zero-shot settings across datasets including Spring and LayeredFlow (Safadoust et al., 30 Mar 2026).

1. Architectural Overview and Pipeline

FlowIt implements a four-stage processing pipeline:

  1. CNN-Based Feature Extraction and Feature Pyramid Network (FPN): Both frames are processed via a CNN and FPN to obtain multi-scale feature maps, fis\mathbf{f}_i^s, at scales s=4,8,16,32s=4,8,16,32.
  2. Global Matching via Optimal Transport: All-pairs matching is computed as a negative similarity cost volume, subsequently solved using entropy-regularized optimal transport (OT).
  3. Confidence-Guided Global Correction: The initial flow, occlusion, and confidence maps produced by OT are refined globally, using a U-Net conditioned on the confidence map.
  4. Iterative Local Refinement: A GRU-based iterative module, following the RAFT style, locally fine-tunes the flow, confidence, and occlusion estimates through several steps.

The backbone, called the Multi-Resolution Transformer (MRT), operates on FPN features. MRT blocks perform both within-scale (horizontal) self-attention and cross-scale (vertical) gated fusion, yielding context-rich features gi\mathbf{g}_i at $1/4$ resolution that are robust to long-range dependencies and large motions.

2. Global Matching via Entropy-Regularized Optimal Transport

The initial matching utilizes OT, formulated as:

minT0i,jCijTijεi,jTijlogTij\min_{\mathbf{T}\geq0} \sum_{i,j} C_{ij}\,T_{ij} - \varepsilon \sum_{i,j}T_{ij}\log T_{ij}

subject to T1=r\mathbf{T} \mathbf{1} = \mathbf{r} and TT1=c\mathbf{T}^T \mathbf{1} = \mathbf{c},

where CC is the cost matrix (negative similarity over all image pairs), ε\varepsilon is the entropy regularization, and r\mathbf{r}, s=4,8,16,32s=4,8,16,320 are uniform marginals (with occlusion “dustbins”). The Sinkhorn algorithm is used for GPU-accelerated iterative solution, yielding a probabilistic matching tensor s=4,8,16,32s=4,8,16,321, from which

  • Initial flow s=4,8,16,32s=4,8,16,322: derived from high-probability matches using a soft-max in a local window s=4,8,16,32s=4,8,16,323,
  • Initial confidence map s=4,8,16,32s=4,8,16,324: local sum of s=4,8,16,32s=4,8,16,325 near peaks,
  • Initial occlusion map s=4,8,16,32s=4,8,16,326: total marginal mass, where low values indicate likely occlusion.

These analytically derived maps provide principled initial estimates, unlike typical heuristic or entirely learned alternatives.

3. Confidence-Guided Correction and Local Refinement

Refinement operates in two sub-stages:

  • Global Confidence-Guided Correction: A binary mask s=4,8,16,32s=4,8,16,327 is generated by thresholding s=4,8,16,32s=4,8,16,328. High-confidence flow estimates are preserved, while low-confidence vectors are replaced by the output of a U-Net aggregation function s=4,8,16,32s=4,8,16,329 that leverages gi\mathbf{g}_i0, gi\mathbf{g}_i1, and gi\mathbf{g}_i2. The updated flow:

gi\mathbf{g}_i3

  • Iterative Local Refinement: A compact RAFT-style GRU module predicts residuals for flow, confidence, and occlusion at each time step gi\mathbf{g}_i4. Updates to gi\mathbf{g}_i5 and gi\mathbf{g}_i6 use logit accumulation to enforce range gi\mathbf{g}_i7:

gi\mathbf{g}_i8

Typically, three refinement steps are found optimal, followed by convex learned upsampling to original resolution.

This explicit utilization of analytic confidence and occlusion maps to steer the refinement distinguishes FlowIt from previous local or heuristic correction pipelines.

4. Benchmark Results and Generalization

FlowIt achieves state-of-the-art optical flow accuracy on widely used datasets. Experimental results include:

  • Sintel (test): XL model ranks 1st on Clean (EPE=0.93), with Final pass EPE=2.29.
  • KITTI (test): XL achieves Fl-all=3.81, best Non-Occluded=1.94.
  • Spring (zero-shot, train split): XL attains EPE=0.390, 1px=3.827%, outperforming all WAFT and SEA-RAFT variants.
  • LayeredFlow (zero-shot, 1/8 resolution): XL achieves EPE=8.01, with best 1px/3px/5px accuracy among competitors.

Cross-dataset zero-shot generalization: When trained only on FlyingChairs and FlyingThings, FlowIt (L) achieves Sintel Clean=0.90, Final=2.11, setting a new state-of-the-art. On Spring (train, zero-shot), FlowIt (XL) is also the top performer among methods not using extra data (Safadoust et al., 30 Mar 2026).

5. Design Decisions and Ablation Insights

Empirical ablation demonstrates:

  • Larger model scale from L to XL yields significant EPE reduction (0.644→0.526 on Chairs val); S and M scales are similar.
  • Explicit supervision on confidence and occlusion maps improves flow accuracy relative to RAFT-style weightings.
  • Initializing from stereo-pretrained weights (following Min et al.) accelerates convergence and reduces EPE by approximately 2.2 px compared to random initialization.
  • Decoupling axis-wise (u/v) flow residuals is necessary for training stability; coupled updates result in divergence.
  • Three refinement steps yield optimum accuracy; more iterations provide marginal or negative returns.

Table: Model Scaling Effect on EPE (Chairs val)

Model EPE
S
M
L 0.644
XL 0.526

— denotes value not explicitly provided.

6. Theoretical and Practical Significance

FlowIt's global matching via OT explicitly enforces mutual consistency and resolves ambiguity in long-range correspondences, which are central limitations in local-matching approaches. The MRT backbone provides a principled trade-off between high resolution and global contextual modelling at manageable memory costs. Analytically derived confidence and occlusion cues allow targeted information propagation, avoiding reliance on heuristic uncertainty estimators. The two-stage refinement leverages non-local high-confidence inpainting and local iterative fine-tuning, resulting in improved performance in both in-domain and cross-domain scenarios. These architectural choices underpin state-of-the-art accuracy and robust domain generalization, as evidenced by benchmark and zero-shot results (Safadoust et al., 30 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlowIt.