Motion-Aware Bundle Adjustment

Updated 10 November 2025

Motion-Aware Bundle Adjustment is a set of techniques that leverage motion structure to optimize camera poses, eliminating explicit 3D points for improved scalability.
It combines a 'pointless' global BA using per-triplet Hessians with a dynamic BA (BA-Track) that decouples camera and object motion using learned 3D tracking.
Empirical evaluations demonstrate that MA-BA achieves comparable accuracy to classical BA with significantly reduced parameters, well-suited for real-world dynamic scenes.

Motion-Aware Bundle Adjustment (MA-BA) encompasses a class of methods that generalize classical bundle adjustment (BA) to better exploit motion structure or to accommodate dynamic scenes. This article surveys two distinct incarnations. The first targets computational efficiency and scalability by representing geometric structure implicitly via per-triplet Hessians in a global optimization over camera poses only, eliminating explicit 3D points and retaining accuracy through principled weighting. The second adapts BA for highly dynamic scenes by explicitly decoupling camera-induced and object-induced motions using learned 3D point trackers and motion decomposition, enabling reliable pose and dense map estimation even when the scene undergoes complex motion. Together, these approaches exemplify recent advances in making BA both more scalable and adaptable to non-static environments.

1. Theoretical Formulation of MA-BA

Two principal forms of Motion-Aware Bundle Adjustment are represented in the literature:

a. Pointless Global BA with Relative Motions Hessians

This formulation eliminates explicit 3D scene points, replacing traditional reprojection error terms with a global objective that aggregates local information from subproblems (typically triplets of images). The core objective for $S$ triplets, each involving $N=3$ cameras, is:

$E^{g}(X, \{d_s\}) = \sum_{s=1}^S \delta x_s^\top h_s \delta x_s = \sum_{s=1}^S \| D_s V_s [d_s(X) - x_{0s}] \|^2$

where:

$x_{0s} \in \mathbb{R}^{6N}$ : local extrinsic parameters (rotations and centers per triplet)
$h_s \in \mathbb{R}^{6N \times 6N}$ : Schur-reduced camera Hessian
$d_s(\cdot)$ : similarity transform mapping global poses $X$ to triplet-local extrinsics
$h_s = V_s^\top D_s^2 V_s$ : diagonalization for efficient least-squares
$X$ : global camera rotations and centers (with possible similarity parameters)

In effect, the contribution of scene structure is compressed into $h_s$ , and the number of unknowns and residuals in the global step is dramatically reduced.

b. MA-BA for Dynamic Scene Reconstruction (Motion Decoupling)

For dynamic scenes (BA-Track), the framework separates motion components:

Total 3D point trajectory: $X_{\text{total}}(s)$ across frames $s=1..S$
Predicted dynamic motion: $X_{\text{dyn}}(s)$ , dynamic confidence $m \in [0,1]$
Static/camera-induced motion:

$X_{\text{static}}(s) = X_{\text{total}}(s) - m \cdot X_{\text{dyn}}(s)$

The BA cost, for $L$ frames, $N$ points per frame, sliding window $S$ , is:

$\min_{\{T_t\}, \{y_n^i\}} \sum_{i=1}^L \sum_{|i-j| \le S} \sum_{n=1}^N W_n^i(j)\,\rho(\|P_j(x_n^i, y_n^i) - X_n^i(j)\|_2) + \alpha \sum_{i,n} \|y_n^i - D_i[x_n^i]\|_2^2$

Here, $W_n^i(j)$ down-weights dynamic/outlier tracks, and $\rho$ denotes the Huber loss.

2. Algorithmic Workflow

a. Pointless Global BA (Relative Motions Hessians)

Feature Extraction & Matching: SIFT features extracted (e.g., via MicMac), build triplet graph.
Local Triplet BA:
- For each triplet, perform a single-iteration local BA, Schur-reduce to obtain $h_s$ .
- Filter triplets with insufficient inliers; attenuate over-connected triplets via weight $\gamma=MQ/(M+Q)$ .
Global Optimization:
- Assemble global objective as above.
- Unknowns: only global camera poses and similarity parameters.
- Solve with sparse Gauss–Newton or Levenberg–Marquardt (Ceres 2.1), rotation parameterized via Lie algebra and re-orthonormalized at each step.
- Optional: Prune graph (skeletonization) for dense video sequences.

b. BA in Dynamic Scenes (BA-Track)

Sampling: For each RGB frame, sample $N$ query points, obtain initial monocular depth $D_i$ .
Sliding Window Tracking: In each window:
- Extract CNN and depth features.
- Use a learned transformer $\mathcal{T}$ to track total and dynamic trajectories, assign visibility $v$ and dynamic mask $m$ .
- Compute $X_{\text{static}}$ for each point.
Static Track Assembly: Aggregate static tracks and weights $W = v \cdot (1-m)$ across windows.
Bundle Adjustment: Solve the weighted objective above over camera poses and (some) point depths.
Depth Refinement:
- Introduce per-frame scale maps $\theta_i$ and per-point scales $\sigma_n^i$ .
- Minimize sparse-to-dense consistency and rigidity losses:
  - $L_{\text{depth}}$ : aligns sparse optimized depths to scaled dense prior.
  - $L_{\text{rigid}}$ : enforces geometric consistency between pairs of static points.
- Optimize with Adam.
Output: Refined camera trajectory and temporally consistent, scale-adjusted dense depth maps.

3. Computational Complexity and Scalability

Methodology	Unknowns (params)	Global Complexity	Run-time Reduction
Pointless Global BA (Rupnik et al., 2023)	$\sim$ 0.1–0.5M	O( $n^3_{\text{cameras}}$ ) worst case	Factor $10\times-40\times$ vs BA
Full Incremental BA (MicMac)	$\sim$ 2-5M	O( $n_{\text{cameras}}^3$ ) with large point blocks	Baseline
5-Point Structureless BA	$\sim$ 0.8M	Similar, but not Hessian-weighted	Higher than pointless BA
BA-Track (dynamic) (Chen et al., 20 Apr 2025)	Windowed / per-frame	Sliding window, per-frame and sparse	Comparable (per window)

The first (pointless BA) approach achieves orders-of-magnitude reduction in unknowns by eliminating explicit points and robustly combining stochastic info via the Hessians. For very dense viewgraphs (e.g., long videos), skeletonization or graph pruning may be necessary to avoid excessive numbers of triplet residuals.

4. Empirical Evaluation

Aerial photogrammetry (UltraCAM Eagle, 2000 images): reprojection error $0.28$ (ours: $0.28$, MicMac BA: $0.27$, using only $0.135$M vs $5.5$M unknowns).
ETH3D mono-planar: Error $0.56$ (ours $0.56$, MicMac $0.56$, openMVG $0.57$, 5-Pt $0.56$), with only $0.5$M params (full BA $2.3$M).
Tanks&Temples "Temple": $3.72$ (ours) vs $3.66$ (MicMac), using $49$k params vs $224$k.
Long focal loop (93 frames): loop closure error $4.10$ (ours), $3.44$ (MicMac), $48.1$ (5-Pt), only $9$k unknowns vs $2.3$M.

Convergence achieved in $10$–$15$ GN iterations, comparable to standard BA, and faster than IRLS-based averaging. Up to $20\%$ of misestimated triplet rotations degrade accuracy less than standard motion averaging.

Camera pose ATE (m): MPI Sintel $0.034$ (BA-Track) vs $0.089$ (best prior); Shibuya $0.028$ vs $0.031$.
Depth metrics: Sintel Abs Rel $0.408$/δ<1.25 $54.1\%$ (ZoeDepth prior $0.467$/ $47.3\%$ ); Shibuya $0.299/55.1\%$ (prior $0.571/43.8\%$ ).
Ablations: Motion decoupling reduces ATE from $0.137$ (no decoupling) to $0.034$ (with full masking); depth-refinement losses improve both sparsity and accuracy.

5. Advantages and Limitations

Advantages

Pointless Global BA:
- Eliminates explicit points: robust, memory-efficient, and fast global convergence.
- Incorporates full local stochastic structure via Hessian weighting, matching classical BA accuracy.
- Robust IRLS weighting (Huber loss) for outlier handling.
- Easily extensible to ground-control point constraints.
Dynamic Scene MA-BA:
- Explicitly decouples dynamic and static motion to prevent contamination of camera pose and structure estimates.
- Incorporates learned 3D tracking, visibility, and motion segmentation.
- Achieves state-of-the-art results on camera pose and dense map accuracy, especially under difficult dynamic conditions.

Limitations

Pointless BA assumes known intrinsics; self-calibration would require local BA to include intrinsics, complicating $h_s$ computation.
For dense triplet graphs, skeletonization may be necessary to retain scalability.
Dynamic MA-BA (BA-Track) depends on quality of learned trackers and requires per-frame depth priors; the masking thresholding process retains some sensitivity to tracker performance.
Both approaches rely on robust preprocessing pipelines for feature extraction, matching, and graph construction; upstream errors can impact outcome.

6. Extensions and Context Within Structure-from-Motion

Motion-aware forms of bundle adjustment are positioned as extensions of both classical BA (joint optimization of camera poses and 3D points) and global motion averaging (which historically ignored covariance information and structure). By encoding local structure into relatively small residual blocks (Hessians), pointless BA (Rupnik et al., 2023) bridges the detection-level stochasticity of full BA with the scalability and simplicity of global motion averaging. The dynamic-scene BA (Chen et al., 20 Apr 2025) generalizes the method to multi-body, nonrigid environments by introducing learning-based motion segmentation and explicit scene flow decomposition.

A plausible implication is increased applicability of BA to large-scale, real-world multi-view and video-based reconstruction, especially in scenes with frequent or persistent dynamic activity, or when computation and memory budgets limit the feasibility of full point-based global BA. These developments reinforce the trend of integrating learned priors, geometric constraints, and robust optimization in the broader structure-from-motion and SLAM communities.

PDF Markdown Chat (Pro)

References (2)

Pointless Global Bundle Adjustment With Relative Motions Hessians (2023)

Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction (2025)

Follow Topic

Get notified by email when new papers are published related to Motion-Aware Bundle Adjustment (MA-BA).

Motion-Aware Bundle Adjustment

1. Theoretical Formulation of MA-BA

a. Pointless Global BA with Relative Motions Hessians

b. MA-BA for Dynamic Scene Reconstruction (Motion Decoupling)

2. Algorithmic Workflow

a. Pointless Global BA (Relative Motions Hessians)

b. BA in Dynamic Scenes (BA-Track)

3. Computational Complexity and Scalability

4. Empirical Evaluation

a. Photogrammetric and Multi-View Datasets (Rupnik et al., 2023)

b. Dynamic Scene Benchmarks (Chen et al., 20 Apr 2025)

5. Advantages and Limitations

Advantages

Limitations

6. Extensions and Context Within Structure-from-Motion

Follow Topic

Continue Learning

Motion-Aware Bundle Adjustment

1. Theoretical Formulation of MA-BA

a. Pointless Global BA with Relative Motions Hessians

b. MA-BA for Dynamic Scene Reconstruction (Motion Decoupling)

2. Algorithmic Workflow

a. Pointless Global BA (Relative Motions Hessians)

b. BA in Dynamic Scenes (BA-Track)

3. Computational Complexity and Scalability

4. Empirical Evaluation

a. Photogrammetric and Multi-View Datasets (Rupnik et al., 2023)

b. Dynamic Scene Benchmarks (Chen et al., 20 Apr 2025)

5. Advantages and Limitations

Advantages

Limitations

6. Extensions and Context Within Structure-from-Motion

Follow Topic

Continue Learning

Related Topics