Papers
Topics
Authors
Recent
Search
2000 character limit reached

MobileStereoNet: Efficient Stereo Vision

Updated 1 April 2026
  • MobileStereoNet is a lightweight stereo vision framework that leverages efficient MobileNet-V1 and V2 blocks for feature extraction.
  • It introduces both 2D and 3D variants with a novel interlaced learnable cost volume to maintain near state-of-the-art disparity accuracy.
  • The framework achieves 2–4× reduction in parameters and 1.6–8× reduction in FLOPs, enabling real-time deployment on resource-constrained devices.

MobileStereoNet is a lightweight stereo vision framework designed to achieve near state-of-the-art accuracy in disparity estimation while dramatically reducing computational complexity and model size. MobileStereoNet introduces both 2D and 3D network variants, built upon efficient MobileNet-V1 and -V2 depthwise separable convolutional blocks, and incorporates a novel learnable cost volume to enable deployment on resource-constrained hardware with minimal performance degradation (Shamsafar et al., 2021).

1. Architectural Overview

Both MobileStereoNet variants implement the standard three-stage stereo matching pipeline: (a) Siamese feature extraction backbone, (b) cost volume construction, and (c) encoder–decoder regularization/disparity regression. The system processes rectified stereo pairs of size H×WH \times W. After three initial 3×33 \times 3 convolutions (stride 2212 \rightarrow 2 \rightarrow 1), a shared backbone generates feature maps of 320×(H/4)×(W/4)320 \times (H/4) \times (W/4). The subsequent steps diverge depending on the variant:

  • 2D-MobileStereoNet reduces the backbone features via four 1×11 \times 1 convolutions to 32×(H/4)×(W/4)32 \times (H/4) \times (W/4), uses a novel interlaced learnable cost volume (dimension D^×(H/4)×(W/4)\hat{D} \times (H/4) \times (W/4), with D^=dmax/4=48\hat{D} = d_{max}/4 = 48 for dmaxd_{max} = 192 disparity), followed by two 2D convolutions and three stacked 2D hourglass modules; the disparity map is upsampled to the original resolution.
  • 3D-MobileStereoNet directly constructs a group-wise correlation cost volume ("Gwc40", dimension 40×D^×(H/4)×(W/4)40 \times \hat{D} \times (H/4) \times (W/4)) from the 3×33 \times 30-channel backbone features, applies two 3D convolutions, and processes the resulting volume through three stacked 3D hourglasses comprised of 3D convolutions (3×33 \times 31 kernels).

In both cases, hourglass modules are symmetric encoder-decoders with skip connections. The only architectural distinction is that all 3D-MobileStereoNet convolutions operate in 3×33 \times 32 versus 3×33 \times 33 in the 2D variant.

2. MobileNet Block Adaptations

MobileStereoNet extends lightweight MobileNet modules to both 2D and 3D settings:

  • MobileNet-V1 Block (depth-wise + point-wise convolution):
    • 2D: depth-wise 3×33 \times 34 per-channel convolution, then 3×33 \times 35 point-wise convolution over all channels.
    • 3D: depth-wise 3×33 \times 36 per-channel convolution, then 3×33 \times 37 point-wise convolution.
  • MobileNet-V2 Block (inverted residual):
    • 2D: 3×33 \times 38 expansion, 3×33 \times 39 depth-wise, and 2212 \rightarrow 2 \rightarrow 10 projection; residual skip applied when appropriate.
    • 3D: analogous steps, with 2212 \rightarrow 2 \rightarrow 11 and 2212 \rightarrow 2 \rightarrow 12 kernels.

Computational savings for these blocks, compared to standard convolutions (for example, 2212 \rightarrow 2 \rightarrow 13, 2212 \rightarrow 2 \rightarrow 14, 2212 \rightarrow 2 \rightarrow 15, expansion 2212 \rightarrow 2 \rightarrow 16), yield approximately 2212 \rightarrow 2 \rightarrow 17 and 2212 \rightarrow 2 \rightarrow 18 operation reduction in 2D (V1, V2 respectively), and 2212 \rightarrow 2 \rightarrow 19 and 320×(H/4)×(W/4)320 \times (H/4) \times (W/4)0 in 3D. Empirical results indicated that the V1 block is preferable for the backbone (maximal reduction in operations), while the V2 block offers better accuracy-operations tradeoff for pre-hourglass and hourglass modules.

3. Learnable Cost Volume Construction

Conventional cost volumes employ either dot-product correlation 320×(H/4)×(W/4)320 \times (H/4) \times (W/4)1 or concatenation 320×(H/4)×(W/4)320 \times (H/4) \times (W/4)2.

MobileStereoNet introduces a parameterized interlacing module for the 2D variant. For each disparity 320×(H/4)×(W/4)320 \times (H/4) \times (W/4)3, the left feature 320×(H/4)×(W/4)320 \times (H/4) \times (W/4)4 and right feature 320×(H/4)×(W/4)320 \times (H/4) \times (W/4)5 (shifted by 320×(H/4)×(W/4)320 \times (H/4) \times (W/4)6) are interlaced in their channel dimensions in groups of 320×(H/4)×(W/4)320 \times (H/4) \times (W/4)7 (taking 320×(H/4)×(W/4)320 \times (H/4) \times (W/4)8 from each), transformed by 320×(H/4)×(W/4)320 \times (H/4) \times (W/4)9D convolution layers applied with non-overlapping channel strides, and projected to a scalar cost per disparity location:

1×11 \times 10

where 1×11 \times 11 denotes the 3D convolutional sub-module on the interlaced channels.

This design yields a cost volume that remains 3D (1×11 \times 12), enabling subsequent regularization modules to operate in pure 2D. Comparative ablation studies on SceneFlow (see table below) confirm that interlaced groupwise cost volumes (specifically, 1×11 \times 13) yield the best accuracy improvement over standard concatenation or correlation.

Method EPE (px) D1 (%) px-3 (%)
concat 1.86 7.46 8.48
corr 1.71 6.80 7.84
interlaced₁ 1.70 6.20 7.06
interlaced₂ 1.61 6.39 7.31
interlaced₄ 1.55 6.15 7.06
interlaced₈ 1.64 6.41 7.35

Interlacing (1×11 \times 14) closes most of the gap between pure 2D and 3D regularization at modest computational cost.

4. Complexity and Efficiency Analysis

A key motivation behind MobileStereoNet is to minimize parameter count and floating-point operations, rendering the network feasible for real-time and on-device deployment. For 1×11 \times 15 inputs (1×11 \times 16), the following summarizes model complexity:

Method Params (M) FLOPs (G)
2D baseline (std convs) 4.07 74.4
2D-MSNet 2.32 32.2
3D baseline (GwcNet-g) 6.43 246.3
3D-MSNet 1.77 153.1
PSMNet (2D+3D) 5.22 256.7
GA-Net-deep 6.58 670

These results demonstrate a 2–4× reduction in model parameters and 1.6–8× reduction in FLOPs compared to leading SOTA approaches without a significant sacrifice in accuracy.

5. Training and Benchmarking Methodology

Training occurs in two phases:

  • Pre-training: Conducted on SceneFlow (35,454 train samples, 4,370 test samples, resolution 1×11 \times 17). Loss is smooth-1×11 \times 18 between predicted and ground-truth disparity at multiple upsampled scales. Optimizer is Adam (1×11 \times 19, 32×(H/4)×(W/4)32 \times (H/4) \times (W/4)0), for 20 epochs at 32×(H/4)×(W/4)32 \times (H/4) \times (W/4)1 (halved at epochs 10, 12, 14, 16), batch size 8 (2D) or 4 (3D).
  • Fine-tuning: On KITTI-15 (160 train/40 val), for 400 epochs (32×(H/4)×(W/4)32 \times (H/4) \times (W/4)2 at epoch 200). Input crops are 32×(H/4)×(W/4)32 \times (H/4) \times (W/4)3 with standard random crop; no additional augmentation employed.

Benchmark results (SceneFlow and KITTI2015):

Method EPE (px) D1 (%) px-3 (%) Params (M) MACs (G)
PSMNet 0.88 2.00 2.10 5.22 256.7
GA-Net-deep 0.63 1.61 1.67 6.58 670.3
GwcNet-g 0.62 1.49 1.53 6.43 246.3
2D-MSNet 0.79 2.53 2.67 2.32 32.2
3D-MSNet 0.66 1.59 1.69 1.77 153.1

MobileStereoNet achieves high accuracy (EPE on par with PSMNet and GwcNet) with the lowest parameter count and memory footprint (2D-MSNet: 10.0 MB; 3D-MSNet: 8.0 MB). Inference time for 32×(H/4)×(W/4)32 \times (H/4) \times (W/4)4 input is 32×(H/4)×(W/4)32 \times (H/4) \times (W/4)525 ms (2D) and 32×(H/4)×(W/4)32 \times (H/4) \times (W/4)645 ms (3D) on a 1080Ti GPU.

6. Ablation Studies and Implementation Details

Detailed ablation studies revealed:

  • MobileNet-V1 backbone achieves a 32×(H/4)×(W/4)32 \times (H/4) \times (W/4)77× reduction in operations with negligible EPE degradation.
  • Hourglass modules implemented as MobileNet-V2 (expansion 32×(H/4)×(W/4)32 \times (H/4) \times (W/4)8) yield a 2–3× reduction in ops.
  • Expansion factor 32×(H/4)×(W/4)32 \times (H/4) \times (W/4)9 in V2 blocks of D^×(H/4)×(W/4)\hat{D} \times (H/4) \times (W/4)0–D^×(H/4)×(W/4)\hat{D} \times (H/4) \times (W/4)1 is optimal for accuracy-efficiency tradeoff; D^×(H/4)×(W/4)\hat{D} \times (H/4) \times (W/4)2 incurs diminishing returns.
  • Replacing the initial D^×(H/4)×(W/4)\hat{D} \times (H/4) \times (W/4)3 convolutions with V2 (D^×(H/4)×(W/4)\hat{D} \times (H/4) \times (W/4)4) reduces D^×(H/4)×(W/4)\hat{D} \times (H/4) \times (W/4)530% operations with minimal accuracy cost.
  • Pre-hourglass convolutions replaced by V2 yield an additional 10% operational reduction.
  • Interlaced cost volume (D^×(H/4)×(W/4)\hat{D} \times (H/4) \times (W/4)6) is critical to narrowing the gap between 2D- and 3D-regularized networks, with D^×(H/4)×(W/4)\hat{D} \times (H/4) \times (W/4)710% better EPE than fixed correlation.

The full model is released in PyTorch and is optimized for practical deployment on moderate GPU hardware. A plausible implication is that these design choices collectively push MobileStereoNet closer to the practical real-time stereo estimation frontier for edge devices.

7. Context and Position Among Stereo Matching Methods

MobileStereoNet incorporates the design principles of efficiency from MobileNet blocks and extends them to high-dimensional stereo cost volumes, outperforming standard architectures in parameter and memory efficiency. Whereas previous architectures such as PSMNet and GwcNet-g reach state-of-the-art accuracy at high computational cost, MobileStereoNet's role is to make such accuracy feasible for real-time, resource-constrained deployment. This positions it as a compelling choice for applications such as mobile robotics, embedded systems, and real-time automotive perception where both low latency and high fidelity are required (Shamsafar et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MobileStereoNet.