3D-MobileStereoNet: Lightweight Stereo Matching

Updated 1 April 2026

The paper introduces lightweight stereo matching by replacing expensive 3D convolutions with depth-wise separable and inverted residual 3D blocks, significantly reducing parameters and computations.
It employs a group-wise correlation cost volume and a multi-stage hourglass encoder-decoder to progressively refine disparity estimates with sub-pixel accuracy.
Empirical evaluations on SceneFlow and KITTI demonstrate competitive accuracy while using far fewer parameters and MACs than previous state‐of‐the‐art models.

3D-MobileStereoNet is a lightweight deep stereo matching network explicitly designed for high efficiency on resource-limited devices without a significant sacrifice in disparity estimation accuracy. Building upon MobileNet-V1 and MobileNet-V2 concepts, the 3D-MobileStereoNet architecture replaces costly 3D convolutions with depth-wise separable and inverted residual 3D blocks, facilitating orders-of-magnitude reduction in parameter count and computational complexity while operating on a group-wise correlation cost volume. The model is structured as a feature-sharing encoder, a group-wise cost volume with 4D structure, and a refined 3D encoder-decoder pipeline with iterative hourglass modules, culminating in sub-pixel disparity estimation via soft-argmin regression (Shamsafar et al., 2021).

1. 3D Depth-wise Separable MobileNet Blocks

3D-MobileStereoNet extends standard MobileNet blocks from 2D to 3D, transforming spatial kernel operations from $k \times k$ to $k \times k \times k$ . Both MobileNet-V1 (depth-wise separable convolution) and MobileNet-V2 (inverted residual with expansion) are adapted as follows:

3D MobileNet-V1 block (v1_3D): Consists of a depth-wise convolution (3×3×3 kernel, stride $s \in \{1,2\}$ , padding=1, channel groups=Cin), followed by a pointwise (1×1×1, Cin→Cout, stride=1, padding=0) convolution, both with batch normalization and ReLU. Residual connections are optional and used only when Cin=Cout and $s=1$ .
3D MobileNet-V2 block (v2_3D, inverted residual): Includes an expansion layer (1×1×1, Cin→Cin·t, stride=1, BN+ReLU6), a depth-wise 3×3×3 convolution (Cin·t channels, stride $s \in \{1,2\}$ , BN+ReLU6), then a projection (1×1×1, Cin·t→Cout, BN). Residual connections are used if Cin=Cout and $s=1$ . Expansion factors are typically $t=3$ in pre-hourglass blocks and $t=2$ inside each hourglass.

A comparative cost analysis (k=3, Cin=32, Cout=64, t=2) is provided:

Operator	#Params	#MACs (per output voxel)
Std. conv3d	$3 \times 3 \times 3 \times \text{Cin} \times \text{Cout}$	same $\times D \times H \times W$
v1_3D block	$k \times k \times k$ 0	same $k \times k \times k$ 1
v2_3D block	$k \times k \times k$ 2	–

The complexity reduction for v2_3D is approximately 7× fewer MACs compared to standard conv3d (Shamsafar et al., 2021).

2. Network Architecture and Layerwise Pipeline

The architecture employs a deep encoder-decoder with repeated hourglass modules, designed for end-to-end learning of disparity from rectified stereo pairs (left $k \times k \times k$ 3, right $k \times k \times k$ 4). For images of size $k \times k \times k$ 5:

Shared 2D feature extraction: ResNet-like stack using an initial conv2d (3×3, stride=2, output 32 channels), followed by one MobileNet-V2 (2D, t=3, stride=1) block, a MobileNet-V1 (2D, stride=2) block, and several 2D v1 blocks, yielding left/right features $k \times k \times k$ 6.
Group-wise correlation cost volume: Both feature maps are split into $k \times k \times k$ 7 groups (8 channels per group). For each disparity $k \times k \times k$ 8, group $k \times k \times k$ 9:

$s \in \{1,2\}$ 0

yielding a cost volume $s \in \{1,2\}$ 1.

Pre-hourglass: Two v2_3D blocks transform the cost volume channels to 32, $s \in \{1,2\}$ 2.
Stacked hourglass modules (3× repeat): Each module applies a sequence of v2_3D (t=2) blocks in a multi-level encoder-decoder structure with three downsample/upsample stages and skip connections, refining cost estimates at each spatial and disparity scale. Auxiliary supervision can be applied to each hourglass output.
Disparity regression: After the hourglass stack, a final 3D conv (3×3×3, 32→1) produces per-voxel scores $s \in \{1,2\}$ 3. The predicted disparity map $s \in \{1,2\}$ 4 is computed using soft-argmin:

$s \in \{1,2\}$ 5

The output is then upsampled by a factor of 4 to achieve full input resolution.

3. Cost Volume Construction and Representation

A key aspect is the construction of a 4D group-wise correlation cost volume, which provides a dense encoding of matching likelihoods across disparity, spatial location, and group. The use of $s \in \{1,2\}$ 6 groups balances fine detail preservation with channel efficiency: $s \in \{1,2\}$ 7 channels per group, $s \in \{1,2\}$ 8 if $s \in \{1,2\}$ 9, $s=1$ 0, $s=1$ 1.

Group-wise correlation is specified as:

$s=1$ 2

where group division promotes local correspondence sensitivity with fewer overall channels than full concatenation. Disparities are estimated at $s=1$ 3 resolution and upsampled.

4. Complexity, Parameter Count, and Efficiency

3D-MobileStereoNet is engineered for low memory and computation overhead, appropriate for resource-constrained hardware:

Model	Params (M)	FLOPs (G MACs)	Reduction vs. GwcNet-g
GwcNet-g (1×HG)	6.43	246.3	–
3D-MobileStereoNet	1.77	153.1	×3.6 fewer params, ×1.6 fewer MACs

At input size $s=1$ 4 with $s=1$ 5, 3D-MobileStereoNet requires ≈8 MB in GPU. The computational formulas used include:

conv3d: $s=1$ 6 MACs,
v2_3D: $s=1$ 7 MACs.

5. Training Protocol and Empirical Performance

Training employs the smooth-L1 loss on predicted vs. ground-truth disparity:

$s=1$ 8

with the Adam optimizer ( $s=1$ 9, $s \in \{1,2\}$ 0). On SceneFlow: 20 epochs, learning rate $s \in \{1,2\}$ 1 decayed at epochs 10, 12, 14, 16, batch=4. KITTI’15 fine-tuning: 400 epochs, learning rate drops at epoch 200, batch=4.

Evaluation protocols:

SceneFlow (“Final pass”): 35,454 training / 4,370 testing samples at $s \in \{1,2\}$ 2, metrics: EPE, px-3, D1.
KITTI 2015: 200 training / 200 testing samples at $s \in \{1,2\}$ 3; 160/40 split for validation.

Quantitative results:

SceneFlow EPE (px): PSMNet (0.88/256 G MACs/5.2M params), GwcNet-g (0.79/246 G/6.4M), 3D-MobileStereoNet (0.80/153 G/1.77M).
KITTI’15 D1(all) (%): PSMNet (2.32), GwcNet-g (2.11), 3D-MobileStereoNet (2.10)—the best among models ≤2M parameters.

6. Core Innovations and Methodological Contributions

Notable technical contributions include:

Introduction of 3D depth-wise separable convolutions, adapted from MobileNet—reducing 3D layer cost by ~7×.
Use of MobileNet-V2 inverted residual blocks with expansion factors $s \in \{1,2\}$ 4 inside the hourglass refines expressive power per parameter.
Group-wise correlation cost volume with $s \in \{1,2\}$ 5 ensures fine detail in a 4D structure with a fraction of the channels of concatenation-based methods.
Multi-stage hourglass pipeline (3 repeats) enables progressive disparity refinement with auxiliary supervision, benefitting training convergence and accuracy.
Soft-argmin on the cost volume delivers end-to-end sub-pixel disparity estimation without the need for post-processing or explicit winner-take-all selection (Shamsafar et al., 2021).

These design elements jointly enable 3D-MobileStereoNet to yield state-of-the-art accuracy within a highly efficient parameter and compute budget, facilitating deployment on moderate GPUs and edge devices.

Markdown Report Issue Upgrade to Chat

References (1)

MobileStereoNet: Towards Lightweight Deep Networks for Stereo Matching (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D-MobileStereoNet.