3D-MobileStereoNet: Lightweight Stereo Matching
- The paper introduces lightweight stereo matching by replacing expensive 3D convolutions with depth-wise separable and inverted residual 3D blocks, significantly reducing parameters and computations.
- It employs a group-wise correlation cost volume and a multi-stage hourglass encoder-decoder to progressively refine disparity estimates with sub-pixel accuracy.
- Empirical evaluations on SceneFlow and KITTI demonstrate competitive accuracy while using far fewer parameters and MACs than previous state‐of‐the‐art models.
3D-MobileStereoNet is a lightweight deep stereo matching network explicitly designed for high efficiency on resource-limited devices without a significant sacrifice in disparity estimation accuracy. Building upon MobileNet-V1 and MobileNet-V2 concepts, the 3D-MobileStereoNet architecture replaces costly 3D convolutions with depth-wise separable and inverted residual 3D blocks, facilitating orders-of-magnitude reduction in parameter count and computational complexity while operating on a group-wise correlation cost volume. The model is structured as a feature-sharing encoder, a group-wise cost volume with 4D structure, and a refined 3D encoder-decoder pipeline with iterative hourglass modules, culminating in sub-pixel disparity estimation via soft-argmin regression (Shamsafar et al., 2021).
1. 3D Depth-wise Separable MobileNet Blocks
3D-MobileStereoNet extends standard MobileNet blocks from 2D to 3D, transforming spatial kernel operations from to . Both MobileNet-V1 (depth-wise separable convolution) and MobileNet-V2 (inverted residual with expansion) are adapted as follows:
- 3D MobileNet-V1 block (v1_3D): Consists of a depth-wise convolution (3×3×3 kernel, stride , padding=1, channel groups=Cin), followed by a pointwise (1×1×1, Cin→Cout, stride=1, padding=0) convolution, both with batch normalization and ReLU. Residual connections are optional and used only when Cin=Cout and .
- 3D MobileNet-V2 block (v2_3D, inverted residual): Includes an expansion layer (1×1×1, Cin→Cin·t, stride=1, BN+ReLU6), a depth-wise 3×3×3 convolution (Cin·t channels, stride , BN+ReLU6), then a projection (1×1×1, Cin·t→Cout, BN). Residual connections are used if Cin=Cout and . Expansion factors are typically in pre-hourglass blocks and inside each hourglass.
A comparative cost analysis (k=3, Cin=32, Cout=64, t=2) is provided:
| Operator | #Params | #MACs (per output voxel) |
|---|---|---|
| Std. conv3d | same | |
| v1_3D block | 0 | same 1 |
| v2_3D block | 2 | – |
The complexity reduction for v2_3D is approximately 7× fewer MACs compared to standard conv3d (Shamsafar et al., 2021).
2. Network Architecture and Layerwise Pipeline
The architecture employs a deep encoder-decoder with repeated hourglass modules, designed for end-to-end learning of disparity from rectified stereo pairs (left 3, right 4). For images of size 5:
- Shared 2D feature extraction: ResNet-like stack using an initial conv2d (3×3, stride=2, output 32 channels), followed by one MobileNet-V2 (2D, t=3, stride=1) block, a MobileNet-V1 (2D, stride=2) block, and several 2D v1 blocks, yielding left/right features 6.
- Group-wise correlation cost volume: Both feature maps are split into 7 groups (8 channels per group). For each disparity 8, group 9:
0
yielding a cost volume 1.
- Pre-hourglass: Two v2_3D blocks transform the cost volume channels to 32, 2.
- Stacked hourglass modules (3× repeat): Each module applies a sequence of v2_3D (t=2) blocks in a multi-level encoder-decoder structure with three downsample/upsample stages and skip connections, refining cost estimates at each spatial and disparity scale. Auxiliary supervision can be applied to each hourglass output.
- Disparity regression: After the hourglass stack, a final 3D conv (3×3×3, 32→1) produces per-voxel scores 3. The predicted disparity map 4 is computed using soft-argmin:
5
The output is then upsampled by a factor of 4 to achieve full input resolution.
3. Cost Volume Construction and Representation
A key aspect is the construction of a 4D group-wise correlation cost volume, which provides a dense encoding of matching likelihoods across disparity, spatial location, and group. The use of 6 groups balances fine detail preservation with channel efficiency: 7 channels per group, 8 if 9, 0, 1.
Group-wise correlation is specified as:
2
where group division promotes local correspondence sensitivity with fewer overall channels than full concatenation. Disparities are estimated at 3 resolution and upsampled.
4. Complexity, Parameter Count, and Efficiency
3D-MobileStereoNet is engineered for low memory and computation overhead, appropriate for resource-constrained hardware:
| Model | Params (M) | FLOPs (G MACs) | Reduction vs. GwcNet-g |
|---|---|---|---|
| GwcNet-g (1×HG) | 6.43 | 246.3 | – |
| 3D-MobileStereoNet | 1.77 | 153.1 | ×3.6 fewer params, ×1.6 fewer MACs |
At input size 4 with 5, 3D-MobileStereoNet requires ≈8 MB in GPU. The computational formulas used include:
- conv3d: 6 MACs,
- v2_3D: 7 MACs.
5. Training Protocol and Empirical Performance
Training employs the smooth-L1 loss on predicted vs. ground-truth disparity:
8
with the Adam optimizer (9, 0). On SceneFlow: 20 epochs, learning rate 1 decayed at epochs 10, 12, 14, 16, batch=4. KITTI’15 fine-tuning: 400 epochs, learning rate drops at epoch 200, batch=4.
Evaluation protocols:
- SceneFlow (“Final pass”): 35,454 training / 4,370 testing samples at 2, metrics: EPE, px-3, D1.
- KITTI 2015: 200 training / 200 testing samples at 3; 160/40 split for validation.
Quantitative results:
- SceneFlow EPE (px): PSMNet (0.88/256 G MACs/5.2M params), GwcNet-g (0.79/246 G/6.4M), 3D-MobileStereoNet (0.80/153 G/1.77M).
- KITTI’15 D1(all) (%): PSMNet (2.32), GwcNet-g (2.11), 3D-MobileStereoNet (2.10)—the best among models ≤2M parameters.
6. Core Innovations and Methodological Contributions
Notable technical contributions include:
- Introduction of 3D depth-wise separable convolutions, adapted from MobileNet—reducing 3D layer cost by ~7×.
- Use of MobileNet-V2 inverted residual blocks with expansion factors 4 inside the hourglass refines expressive power per parameter.
- Group-wise correlation cost volume with 5 ensures fine detail in a 4D structure with a fraction of the channels of concatenation-based methods.
- Multi-stage hourglass pipeline (3 repeats) enables progressive disparity refinement with auxiliary supervision, benefitting training convergence and accuracy.
- Soft-argmin on the cost volume delivers end-to-end sub-pixel disparity estimation without the need for post-processing or explicit winner-take-all selection (Shamsafar et al., 2021).
These design elements jointly enable 3D-MobileStereoNet to yield state-of-the-art accuracy within a highly efficient parameter and compute budget, facilitating deployment on moderate GPUs and edge devices.