MobileStereoNet: Efficient Stereo Vision
- MobileStereoNet is a lightweight stereo vision framework that leverages efficient MobileNet-V1 and V2 blocks for feature extraction.
- It introduces both 2D and 3D variants with a novel interlaced learnable cost volume to maintain near state-of-the-art disparity accuracy.
- The framework achieves 2–4× reduction in parameters and 1.6–8× reduction in FLOPs, enabling real-time deployment on resource-constrained devices.
MobileStereoNet is a lightweight stereo vision framework designed to achieve near state-of-the-art accuracy in disparity estimation while dramatically reducing computational complexity and model size. MobileStereoNet introduces both 2D and 3D network variants, built upon efficient MobileNet-V1 and -V2 depthwise separable convolutional blocks, and incorporates a novel learnable cost volume to enable deployment on resource-constrained hardware with minimal performance degradation (Shamsafar et al., 2021).
1. Architectural Overview
Both MobileStereoNet variants implement the standard three-stage stereo matching pipeline: (a) Siamese feature extraction backbone, (b) cost volume construction, and (c) encoder–decoder regularization/disparity regression. The system processes rectified stereo pairs of size . After three initial convolutions (stride ), a shared backbone generates feature maps of . The subsequent steps diverge depending on the variant:
- 2D-MobileStereoNet reduces the backbone features via four convolutions to , uses a novel interlaced learnable cost volume (dimension , with for = 192 disparity), followed by two 2D convolutions and three stacked 2D hourglass modules; the disparity map is upsampled to the original resolution.
- 3D-MobileStereoNet directly constructs a group-wise correlation cost volume ("Gwc40", dimension ) from the 0-channel backbone features, applies two 3D convolutions, and processes the resulting volume through three stacked 3D hourglasses comprised of 3D convolutions (1 kernels).
In both cases, hourglass modules are symmetric encoder-decoders with skip connections. The only architectural distinction is that all 3D-MobileStereoNet convolutions operate in 2 versus 3 in the 2D variant.
2. MobileNet Block Adaptations
MobileStereoNet extends lightweight MobileNet modules to both 2D and 3D settings:
- MobileNet-V1 Block (depth-wise + point-wise convolution):
- 2D: depth-wise 4 per-channel convolution, then 5 point-wise convolution over all channels.
- 3D: depth-wise 6 per-channel convolution, then 7 point-wise convolution.
- MobileNet-V2 Block (inverted residual):
- 2D: 8 expansion, 9 depth-wise, and 0 projection; residual skip applied when appropriate.
- 3D: analogous steps, with 1 and 2 kernels.
Computational savings for these blocks, compared to standard convolutions (for example, 3, 4, 5, expansion 6), yield approximately 7 and 8 operation reduction in 2D (V1, V2 respectively), and 9 and 0 in 3D. Empirical results indicated that the V1 block is preferable for the backbone (maximal reduction in operations), while the V2 block offers better accuracy-operations tradeoff for pre-hourglass and hourglass modules.
3. Learnable Cost Volume Construction
Conventional cost volumes employ either dot-product correlation 1 or concatenation 2.
MobileStereoNet introduces a parameterized interlacing module for the 2D variant. For each disparity 3, the left feature 4 and right feature 5 (shifted by 6) are interlaced in their channel dimensions in groups of 7 (taking 8 from each), transformed by 9D convolution layers applied with non-overlapping channel strides, and projected to a scalar cost per disparity location:
0
where 1 denotes the 3D convolutional sub-module on the interlaced channels.
This design yields a cost volume that remains 3D (2), enabling subsequent regularization modules to operate in pure 2D. Comparative ablation studies on SceneFlow (see table below) confirm that interlaced groupwise cost volumes (specifically, 3) yield the best accuracy improvement over standard concatenation or correlation.
| Method | EPE (px) | D1 (%) | px-3 (%) |
|---|---|---|---|
| concat | 1.86 | 7.46 | 8.48 |
| corr | 1.71 | 6.80 | 7.84 |
| interlaced₁ | 1.70 | 6.20 | 7.06 |
| interlaced₂ | 1.61 | 6.39 | 7.31 |
| interlaced₄ | 1.55 | 6.15 | 7.06 |
| interlaced₈ | 1.64 | 6.41 | 7.35 |
Interlacing (4) closes most of the gap between pure 2D and 3D regularization at modest computational cost.
4. Complexity and Efficiency Analysis
A key motivation behind MobileStereoNet is to minimize parameter count and floating-point operations, rendering the network feasible for real-time and on-device deployment. For 5 inputs (6), the following summarizes model complexity:
| Method | Params (M) | FLOPs (G) |
|---|---|---|
| 2D baseline (std convs) | 4.07 | 74.4 |
| 2D-MSNet | 2.32 | 32.2 |
| 3D baseline (GwcNet-g) | 6.43 | 246.3 |
| 3D-MSNet | 1.77 | 153.1 |
| PSMNet (2D+3D) | 5.22 | 256.7 |
| GA-Net-deep | 6.58 | 670 |
These results demonstrate a 2–4× reduction in model parameters and 1.6–8× reduction in FLOPs compared to leading SOTA approaches without a significant sacrifice in accuracy.
5. Training and Benchmarking Methodology
Training occurs in two phases:
- Pre-training: Conducted on SceneFlow (35,454 train samples, 4,370 test samples, resolution 7). Loss is smooth-8 between predicted and ground-truth disparity at multiple upsampled scales. Optimizer is Adam (9, 0), for 20 epochs at 1 (halved at epochs 10, 12, 14, 16), batch size 8 (2D) or 4 (3D).
- Fine-tuning: On KITTI-15 (160 train/40 val), for 400 epochs (2 at epoch 200). Input crops are 3 with standard random crop; no additional augmentation employed.
Benchmark results (SceneFlow and KITTI2015):
| Method | EPE (px) | D1 (%) | px-3 (%) | Params (M) | MACs (G) |
|---|---|---|---|---|---|
| PSMNet | 0.88 | 2.00 | 2.10 | 5.22 | 256.7 |
| GA-Net-deep | 0.63 | 1.61 | 1.67 | 6.58 | 670.3 |
| GwcNet-g | 0.62 | 1.49 | 1.53 | 6.43 | 246.3 |
| 2D-MSNet | 0.79 | 2.53 | 2.67 | 2.32 | 32.2 |
| 3D-MSNet | 0.66 | 1.59 | 1.69 | 1.77 | 153.1 |
MobileStereoNet achieves high accuracy (EPE on par with PSMNet and GwcNet) with the lowest parameter count and memory footprint (2D-MSNet: 10.0 MB; 3D-MSNet: 8.0 MB). Inference time for 4 input is 525 ms (2D) and 645 ms (3D) on a 1080Ti GPU.
6. Ablation Studies and Implementation Details
Detailed ablation studies revealed:
- MobileNet-V1 backbone achieves a 77× reduction in operations with negligible EPE degradation.
- Hourglass modules implemented as MobileNet-V2 (expansion 8) yield a 2–3× reduction in ops.
- Expansion factor 9 in V2 blocks of 0–1 is optimal for accuracy-efficiency tradeoff; 2 incurs diminishing returns.
- Replacing the initial 3 convolutions with V2 (4) reduces 530% operations with minimal accuracy cost.
- Pre-hourglass convolutions replaced by V2 yield an additional 10% operational reduction.
- Interlaced cost volume (6) is critical to narrowing the gap between 2D- and 3D-regularized networks, with 710% better EPE than fixed correlation.
The full model is released in PyTorch and is optimized for practical deployment on moderate GPU hardware. A plausible implication is that these design choices collectively push MobileStereoNet closer to the practical real-time stereo estimation frontier for edge devices.
7. Context and Position Among Stereo Matching Methods
MobileStereoNet incorporates the design principles of efficiency from MobileNet blocks and extends them to high-dimensional stereo cost volumes, outperforming standard architectures in parameter and memory efficiency. Whereas previous architectures such as PSMNet and GwcNet-g reach state-of-the-art accuracy at high computational cost, MobileStereoNet's role is to make such accuracy feasible for real-time, resource-constrained deployment. This positions it as a compelling choice for applications such as mobile robotics, embedded systems, and real-time automotive perception where both low latency and high fidelity are required (Shamsafar et al., 2021).