Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Scale Feature Bundle-Adjustment Layer

Updated 27 December 2025
  • The paper introduces a differentiable module that integrates classic bundle adjustment with deep feature pyramids to enforce multi-view geometric constraints.
  • It utilizes a fixed-iteration, differentiable Levenberg–Marquardt solver to jointly optimize poses and depth weights via feature-metric errors across three scales.
  • The approach leverages multi-scale coarse-to-fine refinement and learned basis depth maps to achieve robust, end-to-end dense structure-from-motion.

A Multi-Scale Feature Bundle-Adjustment (BA) Layer is a differentiable module that incorporates classic bundle adjustment strategies within a deep neural network, enforcing multi-view geometric constraints through the minimization of feature-metric error across multiple feature pyramid levels. This approach is integral to the BA-Net architecture (Tang et al., 2018), allowing dense structure-from-motion (SfM) via end-to-end optimization of features, poses, and depth fields within a unified graph by leveraging a fixed-iteration, differentiable Levenberg–Marquardt (LM) solver.

1. Architectural Overview

The core design consists of a backbone convolutional neural network (CNN), specifically DRN-54, feeding two distinct heads:

  • The Feature Pyramid Constructor produces multi-scale feature maps {Fi1,Fi2,Fi3}\{F_i^1,\,F_i^2,\,F_i^3\} per input image IiI_i.
  • The Basis Depth Map Generator is an encoder–decoder that generates KK basis depth maps BRHW×KB \in \mathbb{R}^{H\,W\times K}. These bases represent distinct depth patterns learned to span the plausible depth manifold for the scene.

Depth in the scene is parameterized as a linear combination wRKw \in \mathbb{R}^K of these generated bases. The BA-Layer consumes the feature pyramids, basis depth maps, and current parameters Θ=({Ti},w)\Theta=(\{T_i\},w) — where TiSE(3)T_i \in \mathrm{SE}(3) denotes each camera’s pose — and applies a fixed number of differentiable LM updates. Losses on estimated poses and depths are back-propagated through all parts of the pipeline, enabling end-to-end learning.

2. Feature-Metric Bundle Adjustment Formulation

The innovation lies in using feature-metric error rather than raw photometric error. For each pixel jj in the reference view (i=1i=1), with depth djd_j, its correspondence in another view ii is predicted by warping using the current pose TiT_i: π(Ti,djxˉj)\pi(T_i, d_j\,\bar x_j). The residual for pixel jj in view ii at pyramid level \ell is

ei,jf(Θ)=Fi(π(Ti,djxˉj))F1(xj)    RC,e_{i,j}^f(\Theta) = F_i^\ell\bigl(\pi(T_i,\, d_j\,\bar x_j)\bigr) - F_1^\ell(x_j) \;\in\; \mathbb{R}^C,

where CC is the feature dimension. The overall energy minimized is

E(Θ)==1Li=2Nij=1Njei,jf(Θ)2,E(\Theta) = \sum_{\ell=1}^L \sum_{i=2}^{N_i} \sum_{j=1}^{N_j} \| e_{i,j}^f(\Theta) \|^2,

with L=3L=3 pyramid levels. This cost captures multi-view consistency at several scales, directly leveraging the expressive power of learned feature descriptors.

3. Depth Parameterization Using Learned Basis Maps

To compactly represent dense depth fields with lower-dimensional parameters, a set of KK basis depth maps BB is generated by a learned encoder–decoder head. The final dense depth is reconstructed as

d=ReLU(Bw),dj=ReLU([B]j,:w),d = \mathrm{ReLU}(B w), \quad d_j = \mathrm{ReLU}([B]_{j,:} w),

where ww is optimized during bundle adjustment, while BB remains fixed within an iteration (but is learnable across training). This representation reduces overfitting and ensures physically plausible, non-negative depths via the ReLU nonlinearity.

4. Differentiable Levenberg–Marquardt Optimization

The BA-Layer adapts classical LM optimization to the learning framework. The vector of all feature-metric residuals E(Θ)E(\Theta) is used to assemble the Jacobian J(Θ)=EΘJ(\Theta) = \frac{\partial E}{\partial \Theta}, partitioned for pose and depth-weight parameters. The LM step is: ΔΘ=(H+λD)1JE(Θ),\Delta\Theta = - (H + \lambda D)^{-1} J^\top E(\Theta), with H=JJH = J^\top J and D=diag(H)D = \mathrm{diag}(H). Uniquely, the damping factor λ\lambda is obtained from an MLP fed a 128-D summary of the globally average-pooled absolute residuals, ensuring differentiability throughout.

Pose updates are applied via SE(3) exponential mapping; depth-weights are updated by vector addition: Tiexp(Δξi^)Ti,ww+Δw.T_i \leftarrow \exp(\widehat{\Delta \xi_i}) T_i, \quad w \leftarrow w + \Delta w. This procedure is repeated for a fixed number (five) of LM steps per scale.

5. Multi-Scale Coarse-to-Fine Strategy

The module operates in a coarse-to-fine cascade using three feature pyramid levels:

  • At the coarsest level (=3\ell=3), initial parameters Θ0\Theta_0 are refined via 5 LM steps.
  • The result initializes the next (finer) level (=2\ell=2), again with 5 LM updates.
  • The procedure is repeated at =1\ell=1.

Across scales, the mechanism benefits from increased convergence basins and robust global context at coarse resolutions, while fine levels hone spatial detail.

Scale Level (\ell) Feature Resolution LM Steps per Level
3 (coarsest) Lowest 5
2 Intermediate 5
1 (finest) Highest 5

After each level, depth and pose estimates are upsampled to the next resolution. Poses are fixed-depth-wise during upsampling.

6. End-to-End Differentiability and Gradient Flow

By eliminating all non-differentiable branches (i.e., fixed iteration counts and learned soft λ\lambda), the module is fully unrolled into a computation graph across 3×5=153 \times 5 = 15 iterations. Gradients from losses on camera poses and depth fields propagate through every computational step, including through the MLP that predicts λ\lambda, the Jacobian and residuals, and directly into the feature pyramid and basis depth generators. This enables the network to learn features, bases, and dampening strategies that facilitate rapid and stable convergence during optimization.

7. Supervision, Losses, and Regularization

Training leverages pose and depth supervision:

  • Pose loss combines rotation quaternion and translation vector errors:

Lrot=  q(Ti)q2,Ltrans=t(Ti)t2.\mathcal L_\mathrm{rot}= \|\;q(T_i)-q^\star\|_2, \quad \mathcal L_\mathrm{trans} = \| t(T_i) - t^\star \|_2.

  • Depth loss uses per-pixel BerHu loss:

Ldepth=jBerHu(dj,dj).\mathcal L_\mathrm{depth} = \sum_j \mathrm{BerHu}(d_j, d_j^\star).

The total loss is a weighted sum: L=αLrot+βLtrans+γLdepth,\mathcal L = \alpha\,\mathcal L_\mathrm{rot} + \beta\,\mathcal L_\mathrm{trans} + \gamma\,\mathcal L_\mathrm{depth}, with coefficients balancing each term. No explicit regularizer is required for the depth weights ww or poses; the ReLU on d=Bwd = Bw suffices for non-negativity.


In summary, the Multi-Scale Feature Bundle-Adjustment Layer constitutes a differentiable, iterative feature-metric solver integrated with deep feature and basis learning, optimized within a coarse-to-fine, end-to-end trainable framework for dense structure-from-motion (Tang et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multi-Scale Feature Bundle-Adjustment (BA) Layer.