Inverse Depth Alignment Module
- Inverse Depth Alignment Module is a two-stage process that fuses monocular inverse-depth estimates with sparse metric anchors from VIO.
- It employs a global affine alignment using a least-squares approach followed by a neural network-based per-pixel scale refinement.
- The module significantly improves depth accuracy and cross-dataset generalization, demonstrating lower iRMSE scores even with sparse anchors.
The Inverse Depth Alignment Module is a two-stage process for producing dense, metrically accurate depth estimates by combining monocular inverse-depth inference with sparse metric anchors from visual-inertial odometry (VIO). This module first applies a global affine alignment in the inverse-depth domain based on a least-squares solution, followed by a dense per-pixel scale refinement using a lightweight encoder-decoder network. Initially introduced in the context of monocular visual-inertial depth estimation, this architecture enables accurate dense depth mapping even with limited sparse metric supervision, offering significant improvements over previous depth completion and fusion methods (Wofk et al., 2023).
1. Mathematical Formulation and Workflow
The core pipeline consists of two successive alignment stages:
1.1. Global Scale-and-Shift Alignment
Let denote the affine-invariant, unitless inverse-depth predicted at pixel by a pretrained monocular model, and the matching sparse metric inverse-depth points provided by VIO.
The global alignment stage solves: with closed-form solution
where and are vectors of predicted and sparse inverse-depths at anchor locations, and , are their means. The globally aligned map is
1.2. Learning-Based Dense Alignment (ScaleMapLearner)
Beyond the global correction, local scale discrepancies are eliminated by the ScaleMapLearner neural network. The input comprises two channels:
- Channel 1: the globally aligned, normalized inverse-depth map
- Channel 2: a "scale-scaffold" map , constructed by computing at each sparse pixel an anchor scale , interpolating these over their convex hull, and assigning 1 elsewhere.
The network outputs a residual map , from which the final per-pixel scale is computed as
and the fully aligned dense inverse-depth estimate is
2. Network Architecture and Input Encoding
The ScaleMapLearner utilizes an EfficientNet-Lite3 encoder pretrained on ImageNet, producing four hierarchical resolution blocks. The decoder comprises four Feature-Fusion modules, each combining upsampled decoder features with corresponding encoder skip features via two ResidualConvUnits and convolutions. Residual regression is performed via a convolutional head. Input is strictly limited to the scale-scaffold and normalized ; other cues such as RGB, gradients, or confidences do not improve zero-shot generalization.
Ablations demonstrate optimal performance and robustness are achieved by regressing only scale (not shift) and by omitting alternative input channels; the combination (“scaffold + ”) yields the best cross-dataset transfer.
3. Training Objectives
All losses are computed in the inverse-depth space. Let denote ground-truth inverse-depth and the count of valid pixels. The objective consists of two terms:
- Depth Loss:
- Multiscale Gradient-Matching Loss ("MegaDepth"-style):
where is the -downsampled residual, with .
- Total Loss:
This combination encourages photometrically accurate and edge-preserving per-pixel scale refinements.
4. Incorporation of Sparse Metric Depth
At inference, the VIO frontend (e.g. VINS-Mono) tracks on the order of 100 feature points per frame, providing metric depths . These are projected into monocular image coordinates and downsampled to match the dense model’s resolution. The resulting sparse set fulfills two roles: (a) providing anchors for global affine alignment, and (b) seeding the scale-scaffold map for dense refinement. The “scaffold + ” configuration enables the network to leverage sparse metric support as a strong geometric prior.
This approach allows rapid adaptation to varying density and distribution of VIO points, delivering consistent performance with as few as 50 anchor points and scaling to higher densities without retraining.
5. Empirical Performance and Comparative Evaluation
Substantial improvements over both global alignment and state-of-the-art alternatives are demonstrated across diverse domains and anchor densities. The following table summarizes key quantitative results on TartanAir, VOID, and other datasets ( unless noted):
| Method/Dataset | iRMSE | Relative iRMSE Reduction |
|---|---|---|
| GA only (DPT-Hybrid, TartanAir) | 35.49 | — |
| GA+SML (DPT-Hybrid, TartanAir) | 29.48 | 17% |
| GA only (VOID, 150 pts) | 106.37 | — |
| GA+SML (VOID, 150 pts) | 74.67 | 30% |
| GA+SML (TartanAir-ZS, VOID) | 74.28 | 30% |
| GA+SML (TA pretrain + VOID ft, VOID) | 66.23 | 38% |
| KBNet (VOID, 150 pts) | 128.29 | — |
| GA+SML (DPT-BEIT-Large, VOID, 150 pts) | 57.13 | 55% |
| KBNet (VOID, 500 pts) | 85.59 | — |
| GA+SML (DPT-BEIT, VOID, 500 pts) | 49.85 | 42% |
Performance gains are especially pronounced at low anchor densities, with over 50% lower iRMSE compared to state-of-the-art dense-to-dense completion (e.g. KBNet) on VOID. Robust cross-domain generalization is observed: zero-shot SML trained on synthetic TartanAir matches direct training on VOID, and similar strong results are obtained for NYUv2VOID and VOIDNYUv2 transfer settings. Ablation studies confirm that reliance on only the scale-scaffold and is optimal for this architecture (Wofk et al., 2023).
6. Architectural Modularity and Deployment Considerations
The pipeline admits independent replacement of its constituent stages: monocular predictor, VIO frontend, global alignment, and dense alignment. The Inverse Depth Alignment Module is compatible with any monocular architecture outputting dense affine-invariant inverse-depth (e.g., MiDaS, DPT-Hybrid, DPT-BEIT, SwinV2-Large, LeViT). As monocular backbones improve, the module further elevates metric accuracy. The VIO can be any method yielding 100–1,000 sparse metric depths (e.g., VINS-Mono, XIVO, ORB-SLAM3-VIO).
The computational profile is suitable for embedded deployment: on Jetson AGX Orin, the entire pipeline (MiDaS-small + SML, with TensorRT) executes at ~54 fps (256×256 inputs), and even heavier models such as DPT-Hybrid achieve ~10 fps.
7. Context, Significance, and Generalization
The Inverse Depth Alignment Module unifies metric supervision from VIO with monocular inference by leveraging closed-form affine alignment and data-driven, local scale refinement in inverse-depth space. This design achieves sharp reductions in inverse RMSE for both synthetic and real datasets, and facilitates seamless zero-shot and few-shot transfer between geometric domains. Its fully decoupled, modular nature permits integration into future monocular and VIO pipelines without the need to retrain upstream components (Wofk et al., 2023). A plausible implication is broad applicability in SLAM, robotics, and AR scenarios where metric depth is desirable but direct dense metric supervision is scarce.