Papers
Topics
Authors
Recent
Search
2000 character limit reached

3D-MobileStereoNet: Lightweight Stereo Matching

Updated 1 April 2026
  • The paper introduces lightweight stereo matching by replacing expensive 3D convolutions with depth-wise separable and inverted residual 3D blocks, significantly reducing parameters and computations.
  • It employs a group-wise correlation cost volume and a multi-stage hourglass encoder-decoder to progressively refine disparity estimates with sub-pixel accuracy.
  • Empirical evaluations on SceneFlow and KITTI demonstrate competitive accuracy while using far fewer parameters and MACs than previous state‐of‐the‐art models.

3D-MobileStereoNet is a lightweight deep stereo matching network explicitly designed for high efficiency on resource-limited devices without a significant sacrifice in disparity estimation accuracy. Building upon MobileNet-V1 and MobileNet-V2 concepts, the 3D-MobileStereoNet architecture replaces costly 3D convolutions with depth-wise separable and inverted residual 3D blocks, facilitating orders-of-magnitude reduction in parameter count and computational complexity while operating on a group-wise correlation cost volume. The model is structured as a feature-sharing encoder, a group-wise cost volume with 4D structure, and a refined 3D encoder-decoder pipeline with iterative hourglass modules, culminating in sub-pixel disparity estimation via soft-argmin regression (Shamsafar et al., 2021).

1. 3D Depth-wise Separable MobileNet Blocks

3D-MobileStereoNet extends standard MobileNet blocks from 2D to 3D, transforming spatial kernel operations from k×kk \times k to k×k×kk \times k \times k. Both MobileNet-V1 (depth-wise separable convolution) and MobileNet-V2 (inverted residual with expansion) are adapted as follows:

  • 3D MobileNet-V1 block (v1_3D): Consists of a depth-wise convolution (3×3×3 kernel, stride s{1,2}s \in \{1,2\}, padding=1, channel groups=Cin), followed by a pointwise (1×1×1, Cin→Cout, stride=1, padding=0) convolution, both with batch normalization and ReLU. Residual connections are optional and used only when Cin=Cout and s=1s=1.
  • 3D MobileNet-V2 block (v2_3D, inverted residual): Includes an expansion layer (1×1×1, Cin→Cin·t, stride=1, BN+ReLU6), a depth-wise 3×3×3 convolution (Cin·t channels, stride s{1,2}s \in \{1,2\}, BN+ReLU6), then a projection (1×1×1, Cin·t→Cout, BN). Residual connections are used if Cin=Cout and s=1s=1. Expansion factors are typically t=3t=3 in pre-hourglass blocks and t=2t=2 inside each hourglass.

A comparative cost analysis (k=3, Cin=32, Cout=64, t=2) is provided:

Operator #Params #MACs (per output voxel)
Std. conv3d 3×3×3×Cin×Cout3 \times 3 \times 3 \times \text{Cin} \times \text{Cout} same ×D×H×W\times D \times H \times W
v1_3D block k×k×kk \times k \times k0 same k×k×kk \times k \times k1
v2_3D block k×k×kk \times k \times k2

The complexity reduction for v2_3D is approximately 7× fewer MACs compared to standard conv3d (Shamsafar et al., 2021).

2. Network Architecture and Layerwise Pipeline

The architecture employs a deep encoder-decoder with repeated hourglass modules, designed for end-to-end learning of disparity from rectified stereo pairs (left k×k×kk \times k \times k3, right k×k×kk \times k \times k4). For images of size k×k×kk \times k \times k5:

  1. Shared 2D feature extraction: ResNet-like stack using an initial conv2d (3×3, stride=2, output 32 channels), followed by one MobileNet-V2 (2D, t=3, stride=1) block, a MobileNet-V1 (2D, stride=2) block, and several 2D v1 blocks, yielding left/right features k×k×kk \times k \times k6.
  2. Group-wise correlation cost volume: Both feature maps are split into k×k×kk \times k \times k7 groups (8 channels per group). For each disparity k×k×kk \times k \times k8, group k×k×kk \times k \times k9:

s{1,2}s \in \{1,2\}0

yielding a cost volume s{1,2}s \in \{1,2\}1.

  1. Pre-hourglass: Two v2_3D blocks transform the cost volume channels to 32, s{1,2}s \in \{1,2\}2.
  2. Stacked hourglass modules (3× repeat): Each module applies a sequence of v2_3D (t=2) blocks in a multi-level encoder-decoder structure with three downsample/upsample stages and skip connections, refining cost estimates at each spatial and disparity scale. Auxiliary supervision can be applied to each hourglass output.
  3. Disparity regression: After the hourglass stack, a final 3D conv (3×3×3, 32→1) produces per-voxel scores s{1,2}s \in \{1,2\}3. The predicted disparity map s{1,2}s \in \{1,2\}4 is computed using soft-argmin:

s{1,2}s \in \{1,2\}5

The output is then upsampled by a factor of 4 to achieve full input resolution.

3. Cost Volume Construction and Representation

A key aspect is the construction of a 4D group-wise correlation cost volume, which provides a dense encoding of matching likelihoods across disparity, spatial location, and group. The use of s{1,2}s \in \{1,2\}6 groups balances fine detail preservation with channel efficiency: s{1,2}s \in \{1,2\}7 channels per group, s{1,2}s \in \{1,2\}8 if s{1,2}s \in \{1,2\}9, s=1s=10, s=1s=11.

Group-wise correlation is specified as:

s=1s=12

where group division promotes local correspondence sensitivity with fewer overall channels than full concatenation. Disparities are estimated at s=1s=13 resolution and upsampled.

4. Complexity, Parameter Count, and Efficiency

3D-MobileStereoNet is engineered for low memory and computation overhead, appropriate for resource-constrained hardware:

Model Params (M) FLOPs (G MACs) Reduction vs. GwcNet-g
GwcNet-g (1×HG) 6.43 246.3
3D-MobileStereoNet 1.77 153.1 ×3.6 fewer params, ×1.6 fewer MACs

At input size s=1s=14 with s=1s=15, 3D-MobileStereoNet requires ≈8 MB in GPU. The computational formulas used include:

  • conv3d: s=1s=16 MACs,
  • v2_3D: s=1s=17 MACs.

5. Training Protocol and Empirical Performance

Training employs the smooth-L1 loss on predicted vs. ground-truth disparity:

s=1s=18

with the Adam optimizer (s=1s=19, s{1,2}s \in \{1,2\}0). On SceneFlow: 20 epochs, learning rate s{1,2}s \in \{1,2\}1 decayed at epochs 10, 12, 14, 16, batch=4. KITTI’15 fine-tuning: 400 epochs, learning rate drops at epoch 200, batch=4.

Evaluation protocols:

  • SceneFlow (“Final pass”): 35,454 training / 4,370 testing samples at s{1,2}s \in \{1,2\}2, metrics: EPE, px-3, D1.
  • KITTI 2015: 200 training / 200 testing samples at s{1,2}s \in \{1,2\}3; 160/40 split for validation.

Quantitative results:

  • SceneFlow EPE (px): PSMNet (0.88/256 G MACs/5.2M params), GwcNet-g (0.79/246 G/6.4M), 3D-MobileStereoNet (0.80/153 G/1.77M).
  • KITTI’15 D1(all) (%): PSMNet (2.32), GwcNet-g (2.11), 3D-MobileStereoNet (2.10)—the best among models ≤2M parameters.

6. Core Innovations and Methodological Contributions

Notable technical contributions include:

  • Introduction of 3D depth-wise separable convolutions, adapted from MobileNet—reducing 3D layer cost by ~7×.
  • Use of MobileNet-V2 inverted residual blocks with expansion factors s{1,2}s \in \{1,2\}4 inside the hourglass refines expressive power per parameter.
  • Group-wise correlation cost volume with s{1,2}s \in \{1,2\}5 ensures fine detail in a 4D structure with a fraction of the channels of concatenation-based methods.
  • Multi-stage hourglass pipeline (3 repeats) enables progressive disparity refinement with auxiliary supervision, benefitting training convergence and accuracy.
  • Soft-argmin on the cost volume delivers end-to-end sub-pixel disparity estimation without the need for post-processing or explicit winner-take-all selection (Shamsafar et al., 2021).

These design elements jointly enable 3D-MobileStereoNet to yield state-of-the-art accuracy within a highly efficient parameter and compute budget, facilitating deployment on moderate GPUs and edge devices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D-MobileStereoNet.