Multi-Scale Feature Bundle-Adjustment Layer

Updated 27 December 2025

The paper introduces a differentiable module that integrates classic bundle adjustment with deep feature pyramids to enforce multi-view geometric constraints.
It utilizes a fixed-iteration, differentiable Levenberg–Marquardt solver to jointly optimize poses and depth weights via feature-metric errors across three scales.
The approach leverages multi-scale coarse-to-fine refinement and learned basis depth maps to achieve robust, end-to-end dense structure-from-motion.

A Multi-Scale Feature Bundle-Adjustment (BA) Layer is a differentiable module that incorporates classic bundle adjustment strategies within a deep neural network, enforcing multi-view geometric constraints through the minimization of feature-metric error across multiple feature pyramid levels. This approach is integral to the BA-Net architecture (Tang et al., 2018), allowing dense structure-from-motion (SfM) via end-to-end optimization of features, poses, and depth fields within a unified graph by leveraging a fixed-iteration, differentiable Levenberg–Marquardt (LM) solver.

1. Architectural Overview

The core design consists of a backbone convolutional neural network (CNN), specifically DRN-54, feeding two distinct heads:

The Feature Pyramid Constructor produces multi-scale feature maps $\{F_i^1,\,F_i^2,\,F_i^3\}$ per input image $I_i$ .
The Basis Depth Map Generator is an encoder–decoder that generates $K$ basis depth maps $B \in \mathbb{R}^{H\,W\times K}$ . These bases represent distinct depth patterns learned to span the plausible depth manifold for the scene.

Depth in the scene is parameterized as a linear combination $w \in \mathbb{R}^K$ of these generated bases. The BA-Layer consumes the feature pyramids, basis depth maps, and current parameters $\Theta=(\{T_i\},w)$ — where $T_i \in \mathrm{SE}(3)$ denotes each camera’s pose — and applies a fixed number of differentiable LM updates. Losses on estimated poses and depths are back-propagated through all parts of the pipeline, enabling end-to-end learning.

2. Feature-Metric Bundle Adjustment Formulation

The innovation lies in using feature-metric error rather than raw photometric error. For each pixel $j$ in the reference view ( $i=1$ ), with depth $d_j$ , its correspondence in another view $i$ is predicted by warping using the current pose $T_i$ : $\pi(T_i, d_j\,\bar x_j)$ . The residual for pixel $j$ in view $i$ at pyramid level $\ell$ is

$e_{i,j}^f(\Theta) = F_i^\ell\bigl(\pi(T_i,\, d_j\,\bar x_j)\bigr) - F_1^\ell(x_j) \;\in\; \mathbb{R}^C,$

where $C$ is the feature dimension. The overall energy minimized is

$E(\Theta) = \sum_{\ell=1}^L \sum_{i=2}^{N_i} \sum_{j=1}^{N_j} \| e_{i,j}^f(\Theta) \|^2,$

with $L=3$ pyramid levels. This cost captures multi-view consistency at several scales, directly leveraging the expressive power of learned feature descriptors.

3. Depth Parameterization Using Learned Basis Maps

To compactly represent dense depth fields with lower-dimensional parameters, a set of $K$ basis depth maps $B$ is generated by a learned encoder–decoder head. The final dense depth is reconstructed as

$d = \mathrm{ReLU}(B w), \quad d_j = \mathrm{ReLU}([B]_{j,:} w),$

where $w$ is optimized during bundle adjustment, while $B$ remains fixed within an iteration (but is learnable across training). This representation reduces overfitting and ensures physically plausible, non-negative depths via the ReLU nonlinearity.

4. Differentiable Levenberg–Marquardt Optimization

The BA-Layer adapts classical LM optimization to the learning framework. The vector of all feature-metric residuals $E(\Theta)$ is used to assemble the Jacobian $J(\Theta) = \frac{\partial E}{\partial \Theta}$ , partitioned for pose and depth-weight parameters. The LM step is: $\Delta\Theta = - (H + \lambda D)^{-1} J^\top E(\Theta),$ with $H = J^\top J$ and $D = \mathrm{diag}(H)$ . Uniquely, the damping factor $\lambda$ is obtained from an MLP fed a 128-D summary of the globally average-pooled absolute residuals, ensuring differentiability throughout.

Pose updates are applied via SE(3) exponential mapping; depth-weights are updated by vector addition: $T_i \leftarrow \exp(\widehat{\Delta \xi_i}) T_i, \quad w \leftarrow w + \Delta w.$ This procedure is repeated for a fixed number (five) of LM steps per scale.

5. Multi-Scale Coarse-to-Fine Strategy

The module operates in a coarse-to-fine cascade using three feature pyramid levels:

At the coarsest level ( $\ell=3$ ), initial parameters $\Theta_0$ are refined via 5 LM steps.
The result initializes the next (finer) level ( $\ell=2$ ), again with 5 LM updates.
The procedure is repeated at $\ell=1$ .

Across scales, the mechanism benefits from increased convergence basins and robust global context at coarse resolutions, while fine levels hone spatial detail.

Scale Level ( $\ell$ )	Feature Resolution	LM Steps per Level
3 (coarsest)	Lowest	5
2	Intermediate	5
1 (finest)	Highest	5

After each level, depth and pose estimates are upsampled to the next resolution. Poses are fixed-depth-wise during upsampling.

6. End-to-End Differentiability and Gradient Flow

By eliminating all non-differentiable branches (i.e., fixed iteration counts and learned soft $\lambda$ ), the module is fully unrolled into a computation graph across $3 \times 5 = 15$ iterations. Gradients from losses on camera poses and depth fields propagate through every computational step, including through the MLP that predicts $\lambda$ , the Jacobian and residuals, and directly into the feature pyramid and basis depth generators. This enables the network to learn features, bases, and dampening strategies that facilitate rapid and stable convergence during optimization.

7. Supervision, Losses, and Regularization

Training leverages pose and depth supervision:

Pose loss combines rotation quaternion and translation vector errors:

$\mathcal L_\mathrm{rot}= \|\;q(T_i)-q^\star\|_2, \quad \mathcal L_\mathrm{trans} = \| t(T_i) - t^\star \|_2.$

Depth loss uses per-pixel BerHu loss:

$\mathcal L_\mathrm{depth} = \sum_j \mathrm{BerHu}(d_j, d_j^\star).$

The total loss is a weighted sum: $\mathcal L = \alpha\,\mathcal L_\mathrm{rot} + \beta\,\mathcal L_\mathrm{trans} + \gamma\,\mathcal L_\mathrm{depth},$ with coefficients balancing each term. No explicit regularizer is required for the depth weights $w$ or poses; the ReLU on $d = Bw$ suffices for non-negativity.

In summary, the Multi-Scale Feature Bundle-Adjustment Layer constitutes a differentiable, iterative feature-metric solver integrated with deep feature and basis learning, optimized within a coarse-to-fine, end-to-end trainable framework for dense structure-from-motion (Tang et al., 2018).

Markdown Upgrade to Chat

References (1)

BA-Net: Dense Bundle Adjustment Network (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Scale Feature Bundle-Adjustment (BA) Layer.