Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 73 tok/s Pro
Kimi K2 231 tok/s Pro
GPT OSS 120B 427 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Motion-Aware Bundle Adjustment

Updated 10 November 2025
  • Motion-Aware Bundle Adjustment is a set of techniques that leverage motion structure to optimize camera poses, eliminating explicit 3D points for improved scalability.
  • It combines a 'pointless' global BA using per-triplet Hessians with a dynamic BA (BA-Track) that decouples camera and object motion using learned 3D tracking.
  • Empirical evaluations demonstrate that MA-BA achieves comparable accuracy to classical BA with significantly reduced parameters, well-suited for real-world dynamic scenes.

Motion-Aware Bundle Adjustment (MA-BA) encompasses a class of methods that generalize classical bundle adjustment (BA) to better exploit motion structure or to accommodate dynamic scenes. This article surveys two distinct incarnations. The first targets computational efficiency and scalability by representing geometric structure implicitly via per-triplet Hessians in a global optimization over camera poses only, eliminating explicit 3D points and retaining accuracy through principled weighting. The second adapts BA for highly dynamic scenes by explicitly decoupling camera-induced and object-induced motions using learned 3D point trackers and motion decomposition, enabling reliable pose and dense map estimation even when the scene undergoes complex motion. Together, these approaches exemplify recent advances in making BA both more scalable and adaptable to non-static environments.

1. Theoretical Formulation of MA-BA

Two principal forms of Motion-Aware Bundle Adjustment are represented in the literature:

a. Pointless Global BA with Relative Motions Hessians

This formulation eliminates explicit 3D scene points, replacing traditional reprojection error terms with a global objective that aggregates local information from subproblems (typically triplets of images). The core objective for SS triplets, each involving N=3N=3 cameras, is:

Eg(X,{ds})=s=1Sδxshsδxs=s=1SDsVs[ds(X)x0s]2E^{g}(X, \{d_s\}) = \sum_{s=1}^S \delta x_s^\top h_s \delta x_s = \sum_{s=1}^S \| D_s V_s [d_s(X) - x_{0s}] \|^2

where:

  • x0sR6Nx_{0s} \in \mathbb{R}^{6N}: local extrinsic parameters (rotations and centers per triplet)
  • hsR6N×6Nh_s \in \mathbb{R}^{6N \times 6N}: Schur-reduced camera Hessian
  • ds()d_s(\cdot): similarity transform mapping global poses XX to triplet-local extrinsics
  • hs=VsDs2Vsh_s = V_s^\top D_s^2 V_s: diagonalization for efficient least-squares
  • XX: global camera rotations and centers (with possible similarity parameters)

In effect, the contribution of scene structure is compressed into hsh_s, and the number of unknowns and residuals in the global step is dramatically reduced.

b. MA-BA for Dynamic Scene Reconstruction (Motion Decoupling)

For dynamic scenes (BA-Track), the framework separates motion components:

  • Total 3D point trajectory: Xtotal(s)X_{\text{total}}(s) across frames s=1..Ss=1..S
  • Predicted dynamic motion: Xdyn(s)X_{\text{dyn}}(s), dynamic confidence m[0,1]m \in [0,1]
  • Static/camera-induced motion:

Xstatic(s)=Xtotal(s)mXdyn(s)X_{\text{static}}(s) = X_{\text{total}}(s) - m \cdot X_{\text{dyn}}(s)

The BA cost, for LL frames, NN points per frame, sliding window SS, is:

min{Tt},{yni}i=1LijSn=1NWni(j)ρ(Pj(xni,yni)Xni(j)2)+αi,nyniDi[xni]22\min_{\{T_t\}, \{y_n^i\}} \sum_{i=1}^L \sum_{|i-j| \le S} \sum_{n=1}^N W_n^i(j)\,\rho(\|P_j(x_n^i, y_n^i) - X_n^i(j)\|_2) + \alpha \sum_{i,n} \|y_n^i - D_i[x_n^i]\|_2^2

Here, Wni(j)W_n^i(j) down-weights dynamic/outlier tracks, and ρ\rho denotes the Huber loss.

2. Algorithmic Workflow

a. Pointless Global BA (Relative Motions Hessians)

  1. Feature Extraction & Matching: SIFT features extracted (e.g., via MicMac), build triplet graph.
  2. Local Triplet BA:
    • For each triplet, perform a single-iteration local BA, Schur-reduce to obtain hsh_s.
    • Filter triplets with insufficient inliers; attenuate over-connected triplets via weight γ=MQ/(M+Q)\gamma=MQ/(M+Q).
  3. Global Optimization:
    • Assemble global objective as above.
    • Unknowns: only global camera poses and similarity parameters.
    • Solve with sparse Gauss–Newton or Levenberg–Marquardt (Ceres 2.1), rotation parameterized via Lie algebra and re-orthonormalized at each step.
    • Optional: Prune graph (skeletonization) for dense video sequences.

b. BA in Dynamic Scenes (BA-Track)

  1. Sampling: For each RGB frame, sample NN query points, obtain initial monocular depth DiD_i.
  2. Sliding Window Tracking: In each window:
    • Extract CNN and depth features.
    • Use a learned transformer T\mathcal{T} to track total and dynamic trajectories, assign visibility vv and dynamic mask mm.
    • Compute XstaticX_{\text{static}} for each point.
  3. Static Track Assembly: Aggregate static tracks and weights W=v(1m)W = v \cdot (1-m) across windows.
  4. Bundle Adjustment: Solve the weighted objective above over camera poses and (some) point depths.
  5. Depth Refinement:
    • Introduce per-frame scale maps θi\theta_i and per-point scales σni\sigma_n^i.
    • Minimize sparse-to-dense consistency and rigidity losses:
      • LdepthL_{\text{depth}}: aligns sparse optimized depths to scaled dense prior.
      • LrigidL_{\text{rigid}}: enforces geometric consistency between pairs of static points.
    • Optimize with Adam.
  6. Output: Refined camera trajectory and temporally consistent, scale-adjusted dense depth maps.

3. Computational Complexity and Scalability

Methodology Unknowns (params) Global Complexity Run-time Reduction
Pointless Global BA (Rupnik et al., 2023) \sim 0.1–0.5M O(ncameras3n^3_{\text{cameras}}) worst case Factor 10×40×10\times-40\times vs BA
Full Incremental BA (MicMac) \sim 2-5M O(ncameras3n_{\text{cameras}}^3) with large point blocks Baseline
5-Point Structureless BA \sim 0.8M Similar, but not Hessian-weighted Higher than pointless BA
BA-Track (dynamic) (Chen et al., 20 Apr 2025) Windowed / per-frame Sliding window, per-frame and sparse Comparable (per window)

The first (pointless BA) approach achieves orders-of-magnitude reduction in unknowns by eliminating explicit points and robustly combining stochastic info via the Hessians. For very dense viewgraphs (e.g., long videos), skeletonization or graph pruning may be necessary to avoid excessive numbers of triplet residuals.

4. Empirical Evaluation

  • Aerial photogrammetry (UltraCAM Eagle, 2000 images): reprojection error $0.28$ (ours: $0.28$, MicMac BA: $0.27$, using only $0.135$M vs $5.5$M unknowns).
  • ETH3D mono-planar: Error $0.56$ (ours $0.56$, MicMac $0.56$, openMVG $0.57$, 5-Pt $0.56$), with only $0.5$M params (full BA $2.3$M).
  • Tanks&Temples "Temple": $3.72$ (ours) vs $3.66$ (MicMac), using $49$k params vs $224$k.
  • Long focal loop (93 frames): loop closure error $4.10$ (ours), $3.44$ (MicMac), $48.1$ (5-Pt), only $9$k unknowns vs $2.3$M.

Convergence achieved in $10$–$15$ GN iterations, comparable to standard BA, and faster than IRLS-based averaging. Up to 20%20\% of misestimated triplet rotations degrade accuracy less than standard motion averaging.

  • Camera pose ATE (m): MPI Sintel $0.034$ (BA-Track) vs $0.089$ (best prior); Shibuya $0.028$ vs $0.031$.
  • Depth metrics: Sintel Abs Rel $0.408$/δ<1.25 54.1%54.1\% (ZoeDepth prior $0.467$/ 47.3%47.3\%); Shibuya 0.299/55.1%0.299/55.1\% (prior 0.571/43.8%0.571/43.8\%).
  • Ablations: Motion decoupling reduces ATE from $0.137$ (no decoupling) to $0.034$ (with full masking); depth-refinement losses improve both sparsity and accuracy.

5. Advantages and Limitations

Advantages

  • Pointless Global BA:
    • Eliminates explicit points: robust, memory-efficient, and fast global convergence.
    • Incorporates full local stochastic structure via Hessian weighting, matching classical BA accuracy.
    • Robust IRLS weighting (Huber loss) for outlier handling.
    • Easily extensible to ground-control point constraints.
  • Dynamic Scene MA-BA:
    • Explicitly decouples dynamic and static motion to prevent contamination of camera pose and structure estimates.
    • Incorporates learned 3D tracking, visibility, and motion segmentation.
    • Achieves state-of-the-art results on camera pose and dense map accuracy, especially under difficult dynamic conditions.

Limitations

  • Pointless BA assumes known intrinsics; self-calibration would require local BA to include intrinsics, complicating hsh_s computation.
  • For dense triplet graphs, skeletonization may be necessary to retain scalability.
  • Dynamic MA-BA (BA-Track) depends on quality of learned trackers and requires per-frame depth priors; the masking thresholding process retains some sensitivity to tracker performance.
  • Both approaches rely on robust preprocessing pipelines for feature extraction, matching, and graph construction; upstream errors can impact outcome.

6. Extensions and Context Within Structure-from-Motion

Motion-aware forms of bundle adjustment are positioned as extensions of both classical BA (joint optimization of camera poses and 3D points) and global motion averaging (which historically ignored covariance information and structure). By encoding local structure into relatively small residual blocks (Hessians), pointless BA (Rupnik et al., 2023) bridges the detection-level stochasticity of full BA with the scalability and simplicity of global motion averaging. The dynamic-scene BA (Chen et al., 20 Apr 2025) generalizes the method to multi-body, nonrigid environments by introducing learning-based motion segmentation and explicit scene flow decomposition.

A plausible implication is increased applicability of BA to large-scale, real-world multi-view and video-based reconstruction, especially in scenes with frequent or persistent dynamic activity, or when computation and memory budgets limit the feasibility of full point-based global BA. These developments reinforce the trend of integrating learned priors, geometric constraints, and robust optimization in the broader structure-from-motion and SLAM communities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Motion-Aware Bundle Adjustment (MA-BA).