Multi-Scale Feature Bundle-Adjustment Layer
- The paper introduces a differentiable module that integrates classic bundle adjustment with deep feature pyramids to enforce multi-view geometric constraints.
- It utilizes a fixed-iteration, differentiable Levenberg–Marquardt solver to jointly optimize poses and depth weights via feature-metric errors across three scales.
- The approach leverages multi-scale coarse-to-fine refinement and learned basis depth maps to achieve robust, end-to-end dense structure-from-motion.
A Multi-Scale Feature Bundle-Adjustment (BA) Layer is a differentiable module that incorporates classic bundle adjustment strategies within a deep neural network, enforcing multi-view geometric constraints through the minimization of feature-metric error across multiple feature pyramid levels. This approach is integral to the BA-Net architecture (Tang et al., 2018), allowing dense structure-from-motion (SfM) via end-to-end optimization of features, poses, and depth fields within a unified graph by leveraging a fixed-iteration, differentiable Levenberg–Marquardt (LM) solver.
1. Architectural Overview
The core design consists of a backbone convolutional neural network (CNN), specifically DRN-54, feeding two distinct heads:
- The Feature Pyramid Constructor produces multi-scale feature maps per input image .
- The Basis Depth Map Generator is an encoder–decoder that generates basis depth maps . These bases represent distinct depth patterns learned to span the plausible depth manifold for the scene.
Depth in the scene is parameterized as a linear combination of these generated bases. The BA-Layer consumes the feature pyramids, basis depth maps, and current parameters — where denotes each camera’s pose — and applies a fixed number of differentiable LM updates. Losses on estimated poses and depths are back-propagated through all parts of the pipeline, enabling end-to-end learning.
2. Feature-Metric Bundle Adjustment Formulation
The innovation lies in using feature-metric error rather than raw photometric error. For each pixel in the reference view (), with depth , its correspondence in another view is predicted by warping using the current pose : . The residual for pixel in view at pyramid level is
where is the feature dimension. The overall energy minimized is
with pyramid levels. This cost captures multi-view consistency at several scales, directly leveraging the expressive power of learned feature descriptors.
3. Depth Parameterization Using Learned Basis Maps
To compactly represent dense depth fields with lower-dimensional parameters, a set of basis depth maps is generated by a learned encoder–decoder head. The final dense depth is reconstructed as
where is optimized during bundle adjustment, while remains fixed within an iteration (but is learnable across training). This representation reduces overfitting and ensures physically plausible, non-negative depths via the ReLU nonlinearity.
4. Differentiable Levenberg–Marquardt Optimization
The BA-Layer adapts classical LM optimization to the learning framework. The vector of all feature-metric residuals is used to assemble the Jacobian , partitioned for pose and depth-weight parameters. The LM step is: with and . Uniquely, the damping factor is obtained from an MLP fed a 128-D summary of the globally average-pooled absolute residuals, ensuring differentiability throughout.
Pose updates are applied via SE(3) exponential mapping; depth-weights are updated by vector addition: This procedure is repeated for a fixed number (five) of LM steps per scale.
5. Multi-Scale Coarse-to-Fine Strategy
The module operates in a coarse-to-fine cascade using three feature pyramid levels:
- At the coarsest level (), initial parameters are refined via 5 LM steps.
- The result initializes the next (finer) level (), again with 5 LM updates.
- The procedure is repeated at .
Across scales, the mechanism benefits from increased convergence basins and robust global context at coarse resolutions, while fine levels hone spatial detail.
| Scale Level () | Feature Resolution | LM Steps per Level |
|---|---|---|
| 3 (coarsest) | Lowest | 5 |
| 2 | Intermediate | 5 |
| 1 (finest) | Highest | 5 |
After each level, depth and pose estimates are upsampled to the next resolution. Poses are fixed-depth-wise during upsampling.
6. End-to-End Differentiability and Gradient Flow
By eliminating all non-differentiable branches (i.e., fixed iteration counts and learned soft ), the module is fully unrolled into a computation graph across iterations. Gradients from losses on camera poses and depth fields propagate through every computational step, including through the MLP that predicts , the Jacobian and residuals, and directly into the feature pyramid and basis depth generators. This enables the network to learn features, bases, and dampening strategies that facilitate rapid and stable convergence during optimization.
7. Supervision, Losses, and Regularization
Training leverages pose and depth supervision:
- Pose loss combines rotation quaternion and translation vector errors:
- Depth loss uses per-pixel BerHu loss:
The total loss is a weighted sum: with coefficients balancing each term. No explicit regularizer is required for the depth weights or poses; the ReLU on suffices for non-negativity.
In summary, the Multi-Scale Feature Bundle-Adjustment Layer constitutes a differentiable, iterative feature-metric solver integrated with deep feature and basis learning, optimized within a coarse-to-fine, end-to-end trainable framework for dense structure-from-motion (Tang et al., 2018).