Hierarchical Warping & Occlusion-Aware Noise Suppression
- The paper introduces a pyramidal coarse-to-fine architecture that efficiently captures large displacements and refines flow iteratively.
- It employs a novel sampling-based correlation layer that bypasses interpolation artifacts, effectively mitigating ghosting in feature warping.
- The method integrates explicit occlusion-aware cost reweighting within a shared decoder, yielding significant performance gains on benchmarks like Sintel and KITTI.
Hierarchical Warping and Occlusion-Aware Noise Suppression refers to architectural and algorithmic strategies for optical flow estimation networks, focused on addressing the challenges posed by feature warping artifacts (notably ghosting) and ambiguous matches in occluded regions. These methods are exemplified by the OAS-Net (Occlusion Aware Sampling Network), which replaces traditional warping-based correlation with a sampling-based alternative and integrates explicit occlusion-aware cost reweighting. This combination suppresses noise propagated by occlusions and interpolation, yielding superior flow estimates in challenging scenarios (Kong et al., 2021).
1. Pyramidal Coarse-to-Fine Architecture
Hierarchical (pyramidal) processing is foundational in contemporary optical flow estimation. In OAS-Net, a shared two-layer convolutional subnetwork recursively constructs 6-level feature pyramids for both input images, with each level representing a spatial downsampling by and increasing channels: [16, 32, 64, 96, 128, 160] for %%%%2%%%%.
Flow is estimated progressively from coarse (level 6) to fine (level 1):
- At level , the flow and occlusion map are upsampled by 2 (, ).
- A matching cost volume is computed using sampling-based correlation (see Section 2).
- The raw cost volume and feed into an occlusion-aware module, producing .
- A shared decoder operates on , outputting a flow residual and an updated occlusion map .
- The refined flow is .
This multiscale design enables the system to efficiently capture large displacements and refine flow iteratively.
2. Sampling-Based Correlation Layer
The pivotal methodological innovation is the sampling-based correlation. Standard networks such as PWC-Net deploy feature warping—interpolating target features spatially according to the predicted flow—prior to local inner product correlation. OAS-Net, in contrast, eschews explicit warping altogether.
Correlation at each pixel and displacement offset (with , by default) is computed as: where denotes channel-wise inner product.
This process samples features from the predicted target locations plus a search window, but does not physically shift or interpolate the grid. Therefore, the operation avoids introducing interpolation artifacts and local inconsistencies.
3. Ghosting and Noise in Feature Warping
Feature warping has a known pathology: ghosting. When multiple source locations are mapped to the same warped target location (frequent in occlusions or fast motions), bilinear interpolation aggregates disparate pixel values, resulting in ambiguous, duplicated features (“ghosts”). This can corrupt cost volume construction and thus flow estimation.
Sampling-based correlation addresses this by querying target features independently at specified locations; there is no many-to-one mixing. The result is a cost volume intrinsically robust to aliasing and less affected by motion boundary artifacts. The methodology never physically alters the target feature grid, which precludes the formation of local ghosts.
4. Occlusion-Aware Cost Volume Reweighting
Occluded regions are prone to unreliable matches, as true correspondences do not exist. OAS-Net introduces an explicit occlusion-awareness mechanism:
- Each pyramid level maintains an occlusion-awareness map , estimating the non-occlusion likelihood, upsampled for current use ().
- Complementary weights are defined: , .
- The raw cost volume is reweighted to produce and via elementwise products.
- Two dedicated 2D convolutions (, ) are applied, followed by merging and leaky-ReLU activation: This splitting enables the network to learn different matching filters for visible and occluded regions, akin to a learned self-attention mechanism over the cost volume.
5. Shared Decoder for Flow and Occlusion Estimation
For architectural compactness and consistency, the same decoder is shared across all pyramid levels. This module comprises an 8-layer U-shaped sequence of convolutions (channels: [128→128→128→128→128→96→64→32]), splitting into two prediction heads:
- A flow head—predicting 2-channel residual flow
- An occlusion head—outputting as a sigmoid map constrained to
Sharing the decoder reinforces hierarchical consistency and decreases network complexity.
6. Optimization and Learning
OAS-Net is trained using a multi-scale endpoint error loss (identical to PWC-Net): Here, is the downsampled ground-truth flow at level . The occlusion map is learned implicitly—no explicit ground-truth masks or occlusion-specific losses or regularizers are incorporated.
7. Empirical Performance and Impact
Ablation demonstrates the significance of both the sampling-based correlation and the occlusion module. For Sintel Final/KITTI 2012:
- Warping, no occlusion: 4.05/4.62
- Warping, occlusion: 3.98/4.37
- Sampling, no occlusion: 3.86/4.44
- Sampling, occlusion: 3.79/4.11
Switching from warping to sampling yields a 4.7% drop in Sintel Final EPE. Incorporating occlusion awareness improves KITTI by 5.4%. Combining both yields the largest improvement: 6.4% (Sintel Final) and 11.0% (KITTI 2012).
On public benchmarks, OAS-Net (6.16M parameters, 0.03 s/frame) achieves:
- Sintel Clean test EPE: 3.65 (among best for lightweight networks)
- Sintel Final test EPE: 5.01 (comparable to PWC-Net/IRR-PWC)
- KITTI 2012 test EPE: 1.4 (ties state-of-the-art)
A plausible implication is that hierarchical warping avoidance combined with explicit occlusion-aware noise suppression constitutes an effective paradigm for robust and efficient optical flow estimation, particularly in lightweight network deployments and scenarios with significant occlusions and fast motions (Kong et al., 2021).