Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inverse Depth Alignment Module

Updated 22 February 2026
  • Inverse Depth Alignment Module is a two-stage process that fuses monocular inverse-depth estimates with sparse metric anchors from VIO.
  • It employs a global affine alignment using a least-squares approach followed by a neural network-based per-pixel scale refinement.
  • The module significantly improves depth accuracy and cross-dataset generalization, demonstrating lower iRMSE scores even with sparse anchors.

The Inverse Depth Alignment Module is a two-stage process for producing dense, metrically accurate depth estimates by combining monocular inverse-depth inference with sparse metric anchors from visual-inertial odometry (VIO). This module first applies a global affine alignment in the inverse-depth domain based on a least-squares solution, followed by a dense per-pixel scale refinement using a lightweight encoder-decoder network. Initially introduced in the context of monocular visual-inertial depth estimation, this architecture enables accurate dense depth mapping even with limited sparse metric supervision, offering significant improvements over previous depth completion and fusion methods (Wofk et al., 2023).

1. Mathematical Formulation and Workflow

The core pipeline consists of two successive alignment stages:

1.1. Global Scale-and-Shift Alignment

Let zp(x)=1/dp(x)z_p(x)=1/d_p(x) denote the affine-invariant, unitless inverse-depth predicted at pixel xx by a pretrained monocular model, and {xi,zs(xi)=1/ds(xi)}i=1n\{x_i, z_s(x_i)=1/d_s(x_i)\}_{i=1\dots n} the matching sparse metric inverse-depth points provided by VIO.

The global alignment stage solves: minsg,tg i=1n(sgzp(xi)+tgzs(xi))2\min_{s_g, t_g}~\sum_{i=1}^n \big(s_g z_p(x_i) + t_g - z_s(x_i)\big)^2 with closed-form solution

sg=(Zpμp1)T(Zsμs1)Zpμp2,tg=μssgμps_g = \frac{(Z_p - \mu_p\mathbf{1})^T (Z_s - \mu_s\mathbf{1})}{\|Z_p - \mu_p\|^2},\qquad t_g = \mu_s - s_g \mu_p

where ZpZ_p and ZsZ_s are vectors of predicted and sparse inverse-depths at nn anchor locations, and μp\mu_p, μs\mu_s are their means. The globally aligned map is

z~(x)=sgzp(x)+tg\tilde z(x) = s_g z_p(x) + t_g

1.2. Learning-Based Dense Alignment (ScaleMapLearner)

Beyond the global correction, local scale discrepancies are eliminated by the ScaleMapLearner neural network. The input comprises two channels:

  • Channel 1: the globally aligned, normalized inverse-depth map z~(x)\tilde z(x)
  • Channel 2: a "scale-scaffold" map S0S_0, constructed by computing at each sparse pixel xix_i an anchor scale αi=zs(xi)/z~(xi)\alpha_i = z_s(x_i)/\tilde z(x_i), interpolating these αi\alpha_i over their convex hull, and assigning 1 elsewhere.

The network outputs a residual map r(x)r(x), from which the final per-pixel scale is computed as

S(x)=ReLU(1+r(x))S(x) = \operatorname{ReLU}(1 + r(x))

and the fully aligned dense inverse-depth estimate is

z^(x)=S(x)z~(x)\hat z(x) = S(x) \cdot \tilde z(x)

2. Network Architecture and Input Encoding

The ScaleMapLearner utilizes an EfficientNet-Lite3 encoder pretrained on ImageNet, producing four hierarchical resolution blocks. The decoder comprises four Feature-Fusion modules, each combining upsampled decoder features with corresponding encoder skip features via two ResidualConvUnits and 1×11\times 1 convolutions. Residual regression is performed via a 3×33\times 3 convolutional head. Input is strictly limited to the scale-scaffold S0S_0 and normalized z~\tilde z; other cues such as RGB, gradients, or confidences do not improve zero-shot generalization.

Ablations demonstrate optimal performance and robustness are achieved by regressing only scale (not shift) and by omitting alternative input channels; the combination (“scaffold + z~\tilde z”) yields the best cross-dataset transfer.

3. Training Objectives

All losses are computed in the inverse-depth space. Let z(x)z^*(x) denote ground-truth inverse-depth and MM the count of valid pixels. The objective consists of two terms:

  • L1L_1 Depth Loss:

Ldepth=1Mxz(x)z^(x)L_\text{depth} = \frac{1}{M}\sum_x \left|z^*(x) - \hat z(x)\right|

  • Multiscale Gradient-Matching Loss ("MegaDepth"-style):

Lgrad=1Kk=1K1Mx(xRk(x)+yRk(x)),R(x)=z(x)z^(x)L_\text{grad} = \frac{1}{K}\sum_{k=1}^K \frac{1}{M}\sum_x \big(|\partial_x R^k(x)| + |\partial_y R^k(x)|\big),\quad R(x) = z^*(x) - \hat z(x)

where Rk(x)R^k(x) is the 2k12^{k-1}-downsampled residual, with K=3K=3.

  • Total Loss:

L=Ldepth+0.5LgradL = L_\text{depth} + 0.5\,L_\text{grad}

This combination encourages photometrically accurate and edge-preserving per-pixel scale refinements.

4. Incorporation of Sparse Metric Depth

At inference, the VIO frontend (e.g. VINS-Mono) tracks on the order of 100 feature points per frame, providing metric depths ds(xi)d_s(x_i). These are projected into monocular image coordinates and downsampled to match the dense model’s resolution. The resulting sparse set {zs(xi)}\{z_s(x_i)\} fulfills two roles: (a) providing anchors for global affine alignment, and (b) seeding the scale-scaffold map for dense refinement. The “scaffold + z~\tilde z” configuration enables the network to leverage sparse metric support as a strong geometric prior.

This approach allows rapid adaptation to varying density and distribution of VIO points, delivering consistent performance with as few as 50 anchor points and scaling to higher densities without retraining.

5. Empirical Performance and Comparative Evaluation

Substantial improvements over both global alignment and state-of-the-art alternatives are demonstrated across diverse domains and anchor densities. The following table summarizes key quantitative results on TartanAir, VOID, and other datasets (n=150n=150 unless noted):

Method/Dataset iRMSE Relative iRMSE Reduction
GA only (DPT-Hybrid, TartanAir) 35.49
GA+SML (DPT-Hybrid, TartanAir) 29.48 17%
GA only (VOID, 150 pts) 106.37
GA+SML (VOID, 150 pts) 74.67 30%
GA+SML (TartanAir-ZS, VOID) 74.28 30%
GA+SML (TA pretrain + VOID ft, VOID) 66.23 38%
KBNet (VOID, 150 pts) 128.29
GA+SML (DPT-BEIT-Large, VOID, 150 pts) 57.13 55%
KBNet (VOID, 500 pts) 85.59
GA+SML (DPT-BEIT, VOID, 500 pts) 49.85 42%

Performance gains are especially pronounced at low anchor densities, with over 50% lower iRMSE compared to state-of-the-art dense-to-dense completion (e.g. KBNet) on VOID. Robust cross-domain generalization is observed: zero-shot SML trained on synthetic TartanAir matches direct training on VOID, and similar strong results are obtained for NYUv2\rightarrowVOID and VOID\rightarrowNYUv2 transfer settings. Ablation studies confirm that reliance on only the scale-scaffold and z~\tilde z is optimal for this architecture (Wofk et al., 2023).

6. Architectural Modularity and Deployment Considerations

The pipeline admits independent replacement of its constituent stages: monocular predictor, VIO frontend, global alignment, and dense alignment. The Inverse Depth Alignment Module is compatible with any monocular architecture outputting dense affine-invariant inverse-depth (e.g., MiDaS, DPT-Hybrid, DPT-BEIT, SwinV2-Large, LeViT). As monocular backbones improve, the module further elevates metric accuracy. The VIO can be any method yielding 100–1,000 sparse metric depths (e.g., VINS-Mono, XIVO, ORB-SLAM3-VIO).

The computational profile is suitable for embedded deployment: on Jetson AGX Orin, the entire pipeline (MiDaS-small + SML, with TensorRT) executes at ~54 fps (256×256 inputs), and even heavier models such as DPT-Hybrid achieve ~10 fps.

7. Context, Significance, and Generalization

The Inverse Depth Alignment Module unifies metric supervision from VIO with monocular inference by leveraging closed-form affine alignment and data-driven, local scale refinement in inverse-depth space. This design achieves sharp reductions in inverse RMSE for both synthetic and real datasets, and facilitates seamless zero-shot and few-shot transfer between geometric domains. Its fully decoupled, modular nature permits integration into future monocular and VIO pipelines without the need to retrain upstream components (Wofk et al., 2023). A plausible implication is broad applicability in SLAM, robotics, and AR scenarios where metric depth is desirable but direct dense metric supervision is scarce.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inverse Depth Alignment Module.