Motion-Aware Bundle Adjustment
- Motion-Aware Bundle Adjustment is a set of techniques that leverage motion structure to optimize camera poses, eliminating explicit 3D points for improved scalability.
- It combines a 'pointless' global BA using per-triplet Hessians with a dynamic BA (BA-Track) that decouples camera and object motion using learned 3D tracking.
- Empirical evaluations demonstrate that MA-BA achieves comparable accuracy to classical BA with significantly reduced parameters, well-suited for real-world dynamic scenes.
Motion-Aware Bundle Adjustment (MA-BA) encompasses a class of methods that generalize classical bundle adjustment (BA) to better exploit motion structure or to accommodate dynamic scenes. This article surveys two distinct incarnations. The first targets computational efficiency and scalability by representing geometric structure implicitly via per-triplet Hessians in a global optimization over camera poses only, eliminating explicit 3D points and retaining accuracy through principled weighting. The second adapts BA for highly dynamic scenes by explicitly decoupling camera-induced and object-induced motions using learned 3D point trackers and motion decomposition, enabling reliable pose and dense map estimation even when the scene undergoes complex motion. Together, these approaches exemplify recent advances in making BA both more scalable and adaptable to non-static environments.
1. Theoretical Formulation of MA-BA
Two principal forms of Motion-Aware Bundle Adjustment are represented in the literature:
a. Pointless Global BA with Relative Motions Hessians
This formulation eliminates explicit 3D scene points, replacing traditional reprojection error terms with a global objective that aggregates local information from subproblems (typically triplets of images). The core objective for triplets, each involving cameras, is:
where:
- : local extrinsic parameters (rotations and centers per triplet)
- : Schur-reduced camera Hessian
- : similarity transform mapping global poses to triplet-local extrinsics
- : diagonalization for efficient least-squares
- : global camera rotations and centers (with possible similarity parameters)
In effect, the contribution of scene structure is compressed into , and the number of unknowns and residuals in the global step is dramatically reduced.
b. MA-BA for Dynamic Scene Reconstruction (Motion Decoupling)
For dynamic scenes (BA-Track), the framework separates motion components:
- Total 3D point trajectory: across frames
- Predicted dynamic motion: , dynamic confidence
- Static/camera-induced motion:
The BA cost, for frames, points per frame, sliding window , is:
Here, down-weights dynamic/outlier tracks, and denotes the Huber loss.
2. Algorithmic Workflow
a. Pointless Global BA (Relative Motions Hessians)
- Feature Extraction & Matching: SIFT features extracted (e.g., via MicMac), build triplet graph.
- Local Triplet BA:
- For each triplet, perform a single-iteration local BA, Schur-reduce to obtain .
- Filter triplets with insufficient inliers; attenuate over-connected triplets via weight .
- Global Optimization:
- Assemble global objective as above.
- Unknowns: only global camera poses and similarity parameters.
- Solve with sparse Gauss–Newton or Levenberg–Marquardt (Ceres 2.1), rotation parameterized via Lie algebra and re-orthonormalized at each step.
- Optional: Prune graph (skeletonization) for dense video sequences.
b. BA in Dynamic Scenes (BA-Track)
- Sampling: For each RGB frame, sample query points, obtain initial monocular depth .
- Sliding Window Tracking: In each window:
- Extract CNN and depth features.
- Use a learned transformer to track total and dynamic trajectories, assign visibility and dynamic mask .
- Compute for each point.
- Static Track Assembly: Aggregate static tracks and weights across windows.
- Bundle Adjustment: Solve the weighted objective above over camera poses and (some) point depths.
- Depth Refinement:
- Introduce per-frame scale maps and per-point scales .
- Minimize sparse-to-dense consistency and rigidity losses:
- : aligns sparse optimized depths to scaled dense prior.
- : enforces geometric consistency between pairs of static points.
- Optimize with Adam.
- Output: Refined camera trajectory and temporally consistent, scale-adjusted dense depth maps.
3. Computational Complexity and Scalability
| Methodology | Unknowns (params) | Global Complexity | Run-time Reduction |
|---|---|---|---|
| Pointless Global BA (Rupnik et al., 2023) | 0.1–0.5M | O() worst case | Factor vs BA |
| Full Incremental BA (MicMac) | 2-5M | O() with large point blocks | Baseline |
| 5-Point Structureless BA | 0.8M | Similar, but not Hessian-weighted | Higher than pointless BA |
| BA-Track (dynamic) (Chen et al., 20 Apr 2025) | Windowed / per-frame | Sliding window, per-frame and sparse | Comparable (per window) |
The first (pointless BA) approach achieves orders-of-magnitude reduction in unknowns by eliminating explicit points and robustly combining stochastic info via the Hessians. For very dense viewgraphs (e.g., long videos), skeletonization or graph pruning may be necessary to avoid excessive numbers of triplet residuals.
4. Empirical Evaluation
a. Photogrammetric and Multi-View Datasets (Rupnik et al., 2023)
- Aerial photogrammetry (UltraCAM Eagle, 2000 images): reprojection error $0.28$ (ours: $0.28$, MicMac BA: $0.27$, using only $0.135$M vs $5.5$M unknowns).
- ETH3D mono-planar: Error $0.56$ (ours $0.56$, MicMac $0.56$, openMVG $0.57$, 5-Pt $0.56$), with only $0.5$M params (full BA $2.3$M).
- Tanks&Temples "Temple": $3.72$ (ours) vs $3.66$ (MicMac), using $49$k params vs $224$k.
- Long focal loop (93 frames): loop closure error $4.10$ (ours), $3.44$ (MicMac), $48.1$ (5-Pt), only $9$k unknowns vs $2.3$M.
Convergence achieved in $10$–$15$ GN iterations, comparable to standard BA, and faster than IRLS-based averaging. Up to of misestimated triplet rotations degrade accuracy less than standard motion averaging.
b. Dynamic Scene Benchmarks (Chen et al., 20 Apr 2025)
- Camera pose ATE (m): MPI Sintel $0.034$ (BA-Track) vs $0.089$ (best prior); Shibuya $0.028$ vs $0.031$.
- Depth metrics: Sintel Abs Rel $0.408$/δ<1.25 (ZoeDepth prior $0.467$/ ); Shibuya (prior ).
- Ablations: Motion decoupling reduces ATE from $0.137$ (no decoupling) to $0.034$ (with full masking); depth-refinement losses improve both sparsity and accuracy.
5. Advantages and Limitations
Advantages
- Pointless Global BA:
- Eliminates explicit points: robust, memory-efficient, and fast global convergence.
- Incorporates full local stochastic structure via Hessian weighting, matching classical BA accuracy.
- Robust IRLS weighting (Huber loss) for outlier handling.
- Easily extensible to ground-control point constraints.
- Dynamic Scene MA-BA:
- Explicitly decouples dynamic and static motion to prevent contamination of camera pose and structure estimates.
- Incorporates learned 3D tracking, visibility, and motion segmentation.
- Achieves state-of-the-art results on camera pose and dense map accuracy, especially under difficult dynamic conditions.
Limitations
- Pointless BA assumes known intrinsics; self-calibration would require local BA to include intrinsics, complicating computation.
- For dense triplet graphs, skeletonization may be necessary to retain scalability.
- Dynamic MA-BA (BA-Track) depends on quality of learned trackers and requires per-frame depth priors; the masking thresholding process retains some sensitivity to tracker performance.
- Both approaches rely on robust preprocessing pipelines for feature extraction, matching, and graph construction; upstream errors can impact outcome.
6. Extensions and Context Within Structure-from-Motion
Motion-aware forms of bundle adjustment are positioned as extensions of both classical BA (joint optimization of camera poses and 3D points) and global motion averaging (which historically ignored covariance information and structure). By encoding local structure into relatively small residual blocks (Hessians), pointless BA (Rupnik et al., 2023) bridges the detection-level stochasticity of full BA with the scalability and simplicity of global motion averaging. The dynamic-scene BA (Chen et al., 20 Apr 2025) generalizes the method to multi-body, nonrigid environments by introducing learning-based motion segmentation and explicit scene flow decomposition.
A plausible implication is increased applicability of BA to large-scale, real-world multi-view and video-based reconstruction, especially in scenes with frequent or persistent dynamic activity, or when computation and memory budgets limit the feasibility of full point-based global BA. These developments reinforce the trend of integrating learned priors, geometric constraints, and robust optimization in the broader structure-from-motion and SLAM communities.