MobileStereoNet: Efficient Stereo Vision

Updated 1 April 2026

MobileStereoNet is a lightweight stereo vision framework that leverages efficient MobileNet-V1 and V2 blocks for feature extraction.
It introduces both 2D and 3D variants with a novel interlaced learnable cost volume to maintain near state-of-the-art disparity accuracy.
The framework achieves 2–4× reduction in parameters and 1.6–8× reduction in FLOPs, enabling real-time deployment on resource-constrained devices.

MobileStereoNet is a lightweight stereo vision framework designed to achieve near state-of-the-art accuracy in disparity estimation while dramatically reducing computational complexity and model size. MobileStereoNet introduces both 2D and 3D network variants, built upon efficient MobileNet-V1 and -V2 depthwise separable convolutional blocks, and incorporates a novel learnable cost volume to enable deployment on resource-constrained hardware with minimal performance degradation (Shamsafar et al., 2021).

1. Architectural Overview

Both MobileStereoNet variants implement the standard three-stage stereo matching pipeline: (a) Siamese feature extraction backbone, (b) cost volume construction, and (c) encoder–decoder regularization/disparity regression. The system processes rectified stereo pairs of size $H \times W$ . After three initial $3 \times 3$ convolutions (stride $2 \rightarrow 2 \rightarrow 1$ ), a shared backbone generates feature maps of $320 \times (H/4) \times (W/4)$ . The subsequent steps diverge depending on the variant:

2D-MobileStereoNet reduces the backbone features via four $1 \times 1$ convolutions to $32 \times (H/4) \times (W/4)$ , uses a novel interlaced learnable cost volume (dimension $\hat{D} \times (H/4) \times (W/4)$ , with $\hat{D} = d_{max}/4 = 48$ for $d_{max}$ = 192 disparity), followed by two 2D convolutions and three stacked 2D hourglass modules; the disparity map is upsampled to the original resolution.
3D-MobileStereoNet directly constructs a group-wise correlation cost volume ("Gwc40", dimension $40 \times \hat{D} \times (H/4) \times (W/4)$ ) from the $3 \times 3$ 0-channel backbone features, applies two 3D convolutions, and processes the resulting volume through three stacked 3D hourglasses comprised of 3D convolutions ( $3 \times 3$ 1 kernels).

In both cases, hourglass modules are symmetric encoder-decoders with skip connections. The only architectural distinction is that all 3D-MobileStereoNet convolutions operate in $3 \times 3$ 2 versus $3 \times 3$ 3 in the 2D variant.

2. MobileNet Block Adaptations

MobileStereoNet extends lightweight MobileNet modules to both 2D and 3D settings:

MobileNet-V1 Block (depth-wise + point-wise convolution):
- 2D: depth-wise $3 \times 3$ 4 per-channel convolution, then $3 \times 3$ 5 point-wise convolution over all channels.
- 3D: depth-wise $3 \times 3$ 6 per-channel convolution, then $3 \times 3$ 7 point-wise convolution.
MobileNet-V2 Block (inverted residual):
- 2D: $3 \times 3$ 8 expansion, $3 \times 3$ 9 depth-wise, and $2 \rightarrow 2 \rightarrow 1$ 0 projection; residual skip applied when appropriate.
- 3D: analogous steps, with $2 \rightarrow 2 \rightarrow 1$ 1 and $2 \rightarrow 2 \rightarrow 1$ 2 kernels.

Computational savings for these blocks, compared to standard convolutions (for example, $2 \rightarrow 2 \rightarrow 1$ 3, $2 \rightarrow 2 \rightarrow 1$ 4, $2 \rightarrow 2 \rightarrow 1$ 5, expansion $2 \rightarrow 2 \rightarrow 1$ 6), yield approximately $2 \rightarrow 2 \rightarrow 1$ 7 and $2 \rightarrow 2 \rightarrow 1$ 8 operation reduction in 2D (V1, V2 respectively), and $2 \rightarrow 2 \rightarrow 1$ 9 and $320 \times (H/4) \times (W/4)$ 0 in 3D. Empirical results indicated that the V1 block is preferable for the backbone (maximal reduction in operations), while the V2 block offers better accuracy-operations tradeoff for pre-hourglass and hourglass modules.

3. Learnable Cost Volume Construction

Conventional cost volumes employ either dot-product correlation $320 \times (H/4) \times (W/4)$ 1 or concatenation $320 \times (H/4) \times (W/4)$ 2.

MobileStereoNet introduces a parameterized interlacing module for the 2D variant. For each disparity $320 \times (H/4) \times (W/4)$ 3, the left feature $320 \times (H/4) \times (W/4)$ 4 and right feature $320 \times (H/4) \times (W/4)$ 5 (shifted by $320 \times (H/4) \times (W/4)$ 6) are interlaced in their channel dimensions in groups of $320 \times (H/4) \times (W/4)$ 7 (taking $320 \times (H/4) \times (W/4)$ 8 from each), transformed by $320 \times (H/4) \times (W/4)$ 9D convolution layers applied with non-overlapping channel strides, and projected to a scalar cost per disparity location:

$1 \times 1$ 0

where $1 \times 1$ 1 denotes the 3D convolutional sub-module on the interlaced channels.

This design yields a cost volume that remains 3D ( $1 \times 1$ 2), enabling subsequent regularization modules to operate in pure 2D. Comparative ablation studies on SceneFlow (see table below) confirm that interlaced groupwise cost volumes (specifically, $1 \times 1$ 3) yield the best accuracy improvement over standard concatenation or correlation.

Method	EPE (px)	D1 (%)	px-3 (%)
concat	1.86	7.46	8.48
corr	1.71	6.80	7.84
interlaced₁	1.70	6.20	7.06
interlaced₂	1.61	6.39	7.31
interlaced₄	1.55	6.15	7.06
interlaced₈	1.64	6.41	7.35

Interlacing ( $1 \times 1$ 4) closes most of the gap between pure 2D and 3D regularization at modest computational cost.

4. Complexity and Efficiency Analysis

A key motivation behind MobileStereoNet is to minimize parameter count and floating-point operations, rendering the network feasible for real-time and on-device deployment. For $1 \times 1$ 5 inputs ( $1 \times 1$ 6), the following summarizes model complexity:

Method	Params (M)	FLOPs (G)
2D baseline (std convs)	4.07	74.4
2D-MSNet	2.32	32.2
3D baseline (GwcNet-g)	6.43	246.3
3D-MSNet	1.77	153.1
PSMNet (2D+3D)	5.22	256.7
GA-Net-deep	6.58	670

These results demonstrate a 2–4× reduction in model parameters and 1.6–8× reduction in FLOPs compared to leading SOTA approaches without a significant sacrifice in accuracy.

5. Training and Benchmarking Methodology

Training occurs in two phases:

Pre-training: Conducted on SceneFlow (35,454 train samples, 4,370 test samples, resolution $1 \times 1$ 7). Loss is smooth- $1 \times 1$ 8 between predicted and ground-truth disparity at multiple upsampled scales. Optimizer is Adam ( $1 \times 1$ 9, $32 \times (H/4) \times (W/4)$ 0), for 20 epochs at $32 \times (H/4) \times (W/4)$ 1 (halved at epochs 10, 12, 14, 16), batch size 8 (2D) or 4 (3D).
Fine-tuning: On KITTI-15 (160 train/40 val), for 400 epochs ( $32 \times (H/4) \times (W/4)$ 2 at epoch 200). Input crops are $32 \times (H/4) \times (W/4)$ 3 with standard random crop; no additional augmentation employed.

Benchmark results (SceneFlow and KITTI2015):

Method	EPE (px)	D1 (%)	px-3 (%)	Params (M)	MACs (G)
PSMNet	0.88	2.00	2.10	5.22	256.7
GA-Net-deep	0.63	1.61	1.67	6.58	670.3
GwcNet-g	0.62	1.49	1.53	6.43	246.3
2D-MSNet	0.79	2.53	2.67	2.32	32.2
3D-MSNet	0.66	1.59	1.69	1.77	153.1

MobileStereoNet achieves high accuracy (EPE on par with PSMNet and GwcNet) with the lowest parameter count and memory footprint (2D-MSNet: 10.0 MB; 3D-MSNet: 8.0 MB). Inference time for $32 \times (H/4) \times (W/4)$ 4 input is $32 \times (H/4) \times (W/4)$ 525 ms (2D) and $32 \times (H/4) \times (W/4)$ 645 ms (3D) on a 1080Ti GPU.

6. Ablation Studies and Implementation Details

Detailed ablation studies revealed:

MobileNet-V1 backbone achieves a $32 \times (H/4) \times (W/4)$ 77× reduction in operations with negligible EPE degradation.
Hourglass modules implemented as MobileNet-V2 (expansion $32 \times (H/4) \times (W/4)$ 8) yield a 2–3× reduction in ops.
Expansion factor $32 \times (H/4) \times (W/4)$ 9 in V2 blocks of $\hat{D} \times (H/4) \times (W/4)$ 0– $\hat{D} \times (H/4) \times (W/4)$ 1 is optimal for accuracy-efficiency tradeoff; $\hat{D} \times (H/4) \times (W/4)$ 2 incurs diminishing returns.
Replacing the initial $\hat{D} \times (H/4) \times (W/4)$ 3 convolutions with V2 ( $\hat{D} \times (H/4) \times (W/4)$ 4) reduces $\hat{D} \times (H/4) \times (W/4)$ 530% operations with minimal accuracy cost.
Pre-hourglass convolutions replaced by V2 yield an additional 10% operational reduction.
Interlaced cost volume ( $\hat{D} \times (H/4) \times (W/4)$ 6) is critical to narrowing the gap between 2D- and 3D-regularized networks, with $\hat{D} \times (H/4) \times (W/4)$ 710% better EPE than fixed correlation.

The full model is released in PyTorch and is optimized for practical deployment on moderate GPU hardware. A plausible implication is that these design choices collectively push MobileStereoNet closer to the practical real-time stereo estimation frontier for edge devices.

7. Context and Position Among Stereo Matching Methods

MobileStereoNet incorporates the design principles of efficiency from MobileNet blocks and extends them to high-dimensional stereo cost volumes, outperforming standard architectures in parameter and memory efficiency. Whereas previous architectures such as PSMNet and GwcNet-g reach state-of-the-art accuracy at high computational cost, MobileStereoNet's role is to make such accuracy feasible for real-time, resource-constrained deployment. This positions it as a compelling choice for applications such as mobile robotics, embedded systems, and real-time automotive perception where both low latency and high fidelity are required (Shamsafar et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

MobileStereoNet: Towards Lightweight Deep Networks for Stereo Matching (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MobileStereoNet.