NCUP: Normalized Convolution Upsampler
- NCUP is a novel upsampling technique that frames the task as a sparse interpolation problem using forward mapping and normalized convolution.
- It integrates a lightweight weight estimation network with a U-Net architecture to selectively inpaint missing flow information while maintaining detail.
- Experimental results demonstrate that NCUP achieves state-of-the-art performance, reducing AEPE and preserving edge fidelity with significantly fewer parameters.
The Normalized Convolution UPsampler (NCUP) is a parameter-efficient, joint upsampling technique for optical flow estimation networks that formulates upsampling as a sparse interpolation problem and solves it using normalized convolutional neural networks. NCUP is designed to produce full-resolution optical flow predictions during training and inference, allowing for the preservation of fine-scale motion details while avoiding the blurring, edge leakage, and semantic region mixing commonly associated with standard bilinear interpolation. The method integrates with both coarse-to-fine and recurrent optical flow architectures, achieves state-of-the-art performance, and generalizes robustly across datasets (Eldesokey et al., 2021).
1. Mathematical Foundations of Normalized Convolution
The upsampling problem is framed as learning a mapping that transforms a low-resolution 2D flow field, , into a dense high-resolution field, , with support from ancillary guidance (e.g., image RGB or CNN features). Instead of traditional backward (e.g., bilinear) interpolation, NCUP applies a forward mapping by projecting each sample to its nearest high-resolution integer coordinates for upsampling factor .
This projection yields a sparse high-resolution grid and an associated confidence (weight) map , which is nonzero only where samples are available. A cascade of normalized convolution layers then fills in missing entries. Each layer computes:
where is a learned kernel (typically ), shared spatially and per-channel. This confidence-weighted interpolation framework maintains data fidelity and prevents estimator drift in regions with low confidence.
2. Sparse Interpolation Formulation
NCUP’s distinctive approach interprets upsampling as sparse interpolation. Bilinear interpolation, a backward mapping, guarantees full density at the cost of blurred edges and motion mixing across object boundaries. In contrast, the forward-map and sparse-interpolate scheme ensures that known low-res flow values align precisely with their correct high-res locations, leaving large holes with zero confidence for the network to inpaint. The normalized convolution mechanism uses the explicit confidence map to selectively guide interpolation, restricting information propagation to semantically coherent regions and adaptively respecting localized boundaries.
3. NCUP Network Architecture
NCUP consists of two specialized submodules:
- (A) Weight Estimation Network : This lightweight pixelwise MLP takes as input a concatenation of the low-resolution flow and guidance (either RGB or deep features). It comprises two convolutions with batch normalization and ReLU activations, followed by a convolution with a sigmoid activation. Output is , subsequently forward-mapped to .
- Typical channel settings are for RGB guidance and for deep guidance.
- (B) Normalized-Convolution U-Net: This backbone operates on the high-resolution sparse grid, consuming and . It consists of a two-scale (one down, one up) U-Net with all convolutions as normalized , pooling performed via confidence normalization (not max-pooling), and skip connections. The interpolation network contains only 224 parameters. Combined with , the upsampler totals approximately 2k parameters.
4. Integration into Optical Flow Networks
4.1 Coarse-to-Fine Networks (e.g., PWC-Net, FlowNetS)
In these pipelines, flow is typically predicted at ¼ resolution with multi-scale loss supervision and upsampled via bilinear interpolation at test time. With NCUP, the upsampling module is attached at ¼ resolution. At each training iteration, NCUP upsamples to full resolution and a new term is added to the pyramid loss: using for full-res supervision. This enables end-to-end training and propagates full-res gradients throughout the network.
4.2 Recurrent Networks (e.g., RAFT)
RAFT’s published implementation uses a 3×3 “convex combination” upsampler (∼500k parameters). NCUP replaces this by forward-mapping from ⅛ to ¼ res followed by NCUP from ¼ to full, guided by both the low-res flow and the recurrent state. All other training details match the standard RAFT schedule.
5. Quantitative Performance and Comparison
NCUP demonstrates marked improvements in endpoint error at lower computational cost:
| Method | AEPE (PWC-Net, FlyingChairs) | Upsampler Parameter Count |
|---|---|---|
| Bilinear | 1.58 (−6.5%) | — |
| DJIF | 1.51 (−10.6%) | 56k |
| PAC | 1.50 (−11.2%) | 183k |
| ConvComb | 1.52 (−10.0%) | 44k |
| NCUP | 1.46 (−13.6%) | 2k |
On FlowNetS (FlyingChairs), NCUP achieves 2.13 AEPE (−15.8%) versus baseline 2.53. For RAFT, training on Chairs+Things, NCUP reduces KITTI AEPE from 5.04 to 4.83, with a 6.3% drop in Sintel “Final” pass. After full finetuning, NCUP improves Sintel Final test AEPE from 2.86 to 2.69, with virtually no impact on KITTI test scores, despite 7.5% fewer upsampler parameters.
Runtime overhead of NCUP is <5 ms/frame (1024×436 inputs, 1080 Ti), matching bilinear interpolation and outpacing DJIF and PAC when backpropagation is enabled. The following table summarizes key metrics:
| Upsampler | Params (PWC) | AEPE (Chairs, PWC) | Params (RAFT) | AEPE (Sintel Final, RAFT) |
|---|---|---|---|---|
| Bilinear | — | 1.58 | — | — |
| ConvComb | 44k | 1.52 | 500k | 2.86 |
| NCUP | 2k | 1.46 | 100k | 2.69 |
6. Ablation Studies and Learned Behaviors
Ablation reveals the following:
- Weight Estimation: Sigmoid activation yields lowest AEPE (1.46); replacing it with Softplus increases AEPE to 1.48. Using full-res guidance for weight prediction degrades AEPE (1.75) and causes memory issues. Flow input is essential in ; omitting it worsens AEPE to 1.52.
- Interpolation Network: Adding an extra scale slightly worsens AEPE (1.49). Using max-pooling instead of confidence pooling gives only marginally higher AEPE (1.48).
- Loss Weighting: Best results use for full-res loss; significant deviations yield worse AEPE (∼1.48).
- Learned Weights: Predicted maps concentrate near image edges and fine structures, segmenting object regions and enabling edge-aware interpolation. In flat regions, weights are uniform and interpolation tends to averaging.
7. Mechanisms for Preserving Flow Detail
NCUP’s explicit use of confidence maps allows each normalized convolution layer to enforce data fidelity where confidence is high and rely on interpolation elsewhere. Its multi-scale U-Net structure introduces both local and moderately expansive receptive fields for correcting localized and extended flow artifacts. End-to-end integration ensures feature extraction remains sensitive to full-resolution supervision, channeling loss gradients via NCUP to optimize the entire network stack. This synergy yields sharper, artifact-minimal, and semantically faithful flow fields with consistent 4–14% reductions in endpoint error, while using orders of magnitude fewer parameters compared to prior upsamplers (Eldesokey et al., 2021).