Flow2Stereo: Unified Flow & Stereo Estimation
- The paper introduces an innovative unified framework that leverages geometric constraints to jointly learn optical flow and stereo disparity.
- It employs a single PWC-Net style architecture with shared weights and a two-stage distillation mechanism to ensure robustness and efficiency.
- State-of-the-art KITTI benchmark performance demonstrates improved flow and disparity accuracy, with sharper motion boundaries and depth discontinuities.
Flow2Stereo is a unified framework for the joint self-supervised learning of optical flow and stereo disparity, with a single network architecture trained end-to-end and evaluated across standard benchmarks. The method is premised on the insight that stereo matching can be interpreted as a special case of optical flow under rigid 3D geometry and leverages this relationship to maximize geometric and photometric consistency across multi-view video. Flow2Stereo incorporates advanced self-supervised loss design, a two-stage distillation-based proxy-task training regimen, and achieves state-of-the-art results on unsupervised stereo and flow estimation on the KITTI 2012 and KITTI 2015 datasets, even surpassing several supervised baselines (Liu et al., 2020).
1. Geometric Foundations: Unifying Flow and Stereo
Flow2Stereo models stereo disparity as a one-dimensional case of optical flow constrained by epipolar geometry within a rectified stereo video sequence. In such a scenario, left-right image pairs (at time : , left; , right) exhibit disparities strictly along the horizontal direction, whereas temporal motion between (lateral view at ) is parameterized as 2D flow. The method formulates the link between optical flow and disparity by exploiting rigid-scene geometry:
This expresses the fundamental relationship: the differential between flows in right/left views matches the temporal change in disparity.
The approach generalizes multi-frame relationships into higher-order constraints:
- Quadrilateral (4-frame) constraint:
- Triangle (3-frame) constraints, e.g.:
By exploiting all twelve possible pairwise correspondences in a (time stereo) grid, Flow2Stereo supervises both flow and disparity estimation through rigid multiview consistency.
2. Unified Network Architecture
The architecture adopts a PWC-Net style encoder-decoder:
- Encoder: A six-level feature pyramid, reducing input to $1/64$ spatial resolution, using shared weights for all pairwise correspondences.
- Cost volume: Features from input image pairs are correlated and warped at each scale, yielding a 4D cost volume that is used to estimate flow or disparity.
- Decoder: Residual flow/disparity is predicted in a coarse-to-fine fashion with skip connections.
During training and evaluation, a single network computes all correspondence types (optical flow and stereo) by inputting the appropriate image pair; at inference, horizontal () is used for disparity and both (, ) for flow.
All network weights are shared across the twelve ordered image pairs drawn from the set , ensuring parameter and representation efficiency.
3. Self-Supervised Loss Design
Flow2Stereo introduces multiple loss terms to enforce photometric and geometric consistency, applied exclusively to "confident" pixels as determined by a forward-backward check mask .
- Photometric Consistency Loss:
where with , .
- Quadrilateral and Triangle Constraint Losses: These encode multi-frame, cross-view geometric relationships (see Section 1 for formulas), further regularizing correspondence predictions.
The overall teacher-stage loss:
In the second (student) stage, a proxy supervision loss encourages the student to mimic the "teacher's" confident predictions under more heavily augmented inputs.
No explicit local smoothness regularization proved necessary, as geometry and hard supervision provided sufficient inductive bias.
4. Proxy Tasks and Data Augmentation
A two-stage proxy-task regimen is central to Flow2Stereo’s approach:
- Stage 1 (Teacher): The network is taught using traditional photometric and geometric constraints with moderate augmentations (cropping, color jitter, noise, flipping).
- Stage 2 (Student): The student is presented with challenging, heavily augmented pairs:
- Random cropping
- Gaussian or salt-and-pepper noise
- Random scale perturbation (downsampling/upsampling)
- Occasional horizontal flips
Here, the network must learn to predict correspondences in conditions where photometric constancy is violated, relying on the teacher’s confident predictions even in areas with occlusion or lack of ground-truth labels. Formally, for an augmented operator , the student is trained to match: This distillation enforces robustness beyond trivial solutions.
5. Training Protocol
- Dataset: Raw multi-view data from KITTI 2012/2015, excluding selected frames (9–12).
- Optimization:
- Teacher: iterations, Adam optimizer (), learning rate halved every $50,000$ iterations, batch size $1$.
- Student: iterations, heavier augmentations, batch size $4$, decayed learning rate.
- At inference: A single forward pass produces both flow and disparity estimates per image pair.
This schedule enables effective convergence, with the student learning to generalize from the teacher's outputs under challenging conditions.
6. Benchmark Performance on KITTI Datasets
Flow2Stereo demonstrates competitive or superior performance on KITTI benchmarks relative to both unsupervised and supervised baselines.
Optical Flow (EPE = endpoint error, Fl = pixel error %)
| Method | KITTI 2012 (test: EPE/Fl) | KITTI 2015 (test: Fl) |
|---|---|---|
| PWC-Net (sup.) | 1.7 / 8.10% | 9.60% |
| FlowNet2 (sup.) | 1.8 / 8.80% | 10.41% |
| SelFlow (sup.) | 1.5 / 6.19% | 8.42% |
| DDFlow (unsup.) | 3.0 / 8.86% | 14.29% |
| SelFlow (unsup.) | 2.2 / 7.68% | 14.19% |
| Flow2Stereo | 1.7 / 7.63% | 11.10% |
Stereo Disparity (EPE / D1 metric)
| Method | KITTI 2012 (test: D1) | KITTI 2015 (test: D1) |
|---|---|---|
| Godard et al. (unsup.) | – | – |
| SeqStereo (unsup.) | – | – |
| Guo et al. (semi.) | 6.45% | 7.06% |
| UnOS (unsup.) | 5.93% | 6.67% |
| Flow2Stereo | 5.11% | 6.61% |
Qualitative analysis illustrates sharper object boundaries and improved preservation of depth discontinuities in both flow and stereo, especially around motion boundaries and textureless/repetitive regions.
7. Limitations, Ablation Studies, and Future Extensions
Flow2Stereo’s geometric supervision is contingent upon rectified stereo projections and fixed baselines. Application to non-rigid or unrectified camera setups would require extension, such as learning fundamental matrices or more flexible geometric priors. Occluded regions remain a challenge; direct enforcement of geometry in these regions can, in some cases, slightly degrade performance.
Ablation results indicate that introducing quadrilateral and triangle losses reduces KITTI 2012 EPE from $1.06$ to $0.84$. Further, challenging proxy tasks produce a 35%-40% decrease in Fl error, with augmentation strategies proving more critical than per-pixel confidence weighting.
Proposed directions include:
- Generalizing to dynamic or non-rigid multi-view setups through learned geometric models.
- Integrating semantic or panoptic segmentation to handle thin structures and dynamic occlusions.
- Leveraging longer temporal windows and multi-hop temporal consistency for improved stability and generalization.
Flow2Stereo demonstrates that a single PWC-Net-based network, trained with strong geometric constraints and rigorous self-supervision—including challenging proxy task augmentation—can achieve parity with, or outperform, fully supervised models in dense correspondence estimation, all without access to ground-truth data (Liu et al., 2020).