Visual-Inertial Bundle Adjustment

Updated 27 November 2025

Visual-Inertial Bundle Adjustment is a nonlinear optimization framework that combines visual reprojection and inertial preintegration to jointly estimate sensor trajectories and environment geometry.
It employs robust residuals, marginalization strategies, and factor graphs to achieve real-time, accurate state estimation even in challenging and dynamic environments.
Modern implementations integrate dense, structureless, and deep learning techniques to enhance mapping fidelity, computational efficiency, and system adaptability.

Visual-Inertial Bundle Adjustment (VIBA) is the class of nonlinear optimization frameworks that jointly estimate sensor trajectory and environment geometry by fusing visual and inertial measurements, typically over a sliding window of keyframes. Visual residuals enforce geometric consistency through feature reprojection or direct photometric error, while inertial residuals enforce dynamic consistency via IMU preintegration. Modern VIBA encompasses a rich set of factor graph, marginalization, and parameterization strategies, supporting dense or sparse mapping, hybrid optimization, multi-sensor integration, and real-time deployment.

1. Problem Formulation and State Representation

At its core, VIBA defines a minimum energy or maximum a posteriori estimate over the stacked state of the system by minimizing the sum of visual, inertial, and optionally other sensor residuals. The state vector typically includes, for each keyframe $i$ :

Pose in a reference frame: $\mathbf{T}_i \in \mathrm{SE}(3)$ ; sometimes represented as separate rotation $\mathbf{R}_i \in \mathrm{SO}(3)$ , position $\mathbf{p}_i$ , with velocities $\mathbf{v}_i \in \mathbb{R}^3$ .
IMU biases: $\mathbf{b}_{a,i}, \mathbf{b}_{g,i} \in \mathbb{R}^3$ .
3D map points, e.g., $\mathbf{p}_j^v \in \mathbb{R}^3$ , or dense depth maps in direct methods.

The joint cost function is built from:

Visual terms, e.g.,

$E_{\text{vis}} = \sum_{(i,j)\in V} \rho \Big( \left\| u_{i,j} - \pi( \mathbf{T}_i^{-1} \mathbf{p}_j^v ) \right\|_{\Omega_{ij}}^2 \Big)$

Inertial (IMU preintegration) terms, e.g.,

$E_{\text{IMU}} = \sum_{(i,k)\in I} \rho \left( r_{\text{IMU}}^\top \Omega_{\text{IMU}} r_{\text{IMU}} \right)$

where the IMU residuals connect consecutive poses and biases using preintegrated measurements and first-order bias correction (Usenko et al., 2019, Zhang et al., 1 Apr 2024, Demmel et al., 2021).

Optional priors/multimodal terms: e.g., LiDAR map residuals (Ding et al., 2018), wheel odometry (Zhou et al., 20 Mar 2024), roll–pitch factors (Usenko et al., 2019), etc.

The complete VIBA energy, in the case of multi-modal integration, is: $J(\mathcal{X}) = E_{\text{vis}} + E_{\text{IMU}} + E_{\text{prior}} + \ldots$

State parameterizations exploit appropriate Lie group structures (e.g., $\mathrm{SE}(3)$ , $\mathrm{SO}(3)$ ), quaternions, or time-continuous bases (e.g., Chebyshev expansions (Zhang et al., 1 Apr 2024)).

2. Visual and Inertial Residuals

Visual Residuals

Visual residuals can be:

Reprojection error: The difference between observed and projected image locations. For sparse feature-based BA:

$r_{\text{proj}}^{ij} = u_{i,j} - \pi( \mathbf{R}_i \mathbf{p}_j + \mathbf{t}_i )$

where $\pi(\cdot)$ is the camera projection (Quan et al., 2017, Ding et al., 2018).

Photometric error: The intensity difference at matched pixels, often used in direct methods with or without rolling-shutter modeling, and accounting for affine brightness (Schubert et al., 2019, Stumberg et al., 2022).
Dense optical flow errors and their learned confidence weights in deep-learning-augmented systems (Zhou et al., 20 Mar 2024).

Inertial Residuals

Inertial factors connect adjacent (or selected) keyframes using IMU preintegration as in Forster et al.: $r_{\text{IMU}} = \begin{bmatrix} r_{\Delta R} \ r_{\Delta v} \ r_{\Delta p} \ r_{b} \end{bmatrix}$ with bias correction Jacobians propagated to keep the linearization robust to bias changes (Quan et al., 2017, Usenko et al., 2019, Demmel et al., 2021).

Some VIBA frameworks include priors such as LiDAR-based constraints: $r_{\text{map}}^j = n_m^{l \top} (q^j - p_m^l)$ point-to-plane, with robust information, anchoring local VIO to a global map (Ding et al., 2018).

3. Optimization Strategy, Marginalization, and Real-Time Operation

Sliding Window and Marginalization

VIBA typically employs a sliding window of $K$ keyframes; as new data is added, old frames are marginalized to bound computational cost. Marginalization is implemented via:

Schur complement: Old states or landmarks are eliminated, resulting in a prior on the active variables (Quan et al., 2017, Usenko et al., 2019, Stumberg et al., 2022).
Square-root marginalization: Residuals are stored in square-root form via (block-)QR, enhancing numerical stability especially in single precision (Demmel et al., 2021).

Efficient sparse matrix linear algebra (block-sparse Cholesky, Schur elimination) is used to exploit problem structure.

Non-Rigid, Rigid, and Hybrid BA

Non-rigid BA: Jointly optimizes all unknowns (poses, map points, biases). Provides maximal accuracy but can be less robust to poor initialization.
Rigid BA: Holds points fixed, optimizing only poses/velocities/biases—useful for coarse registrations or fast convergence.
Hybrid/alternating: Alternates rigid and non-rigid steps for practical trade-off (Ding et al., 2018).

Structureless and Direct Formulations

Structureless VIBA eliminates 3D landmarks analytically by formulating visual epipolar constraints between pairs of frames seen as pairwise residuals, favoring faster solve times and smaller state vectors at some accuracy cost (Song et al., 23 Feb 2025).
Direct methods operate on intensity residuals and model the full photometric error, supporting dense geometry and rolling shutter (Schubert et al., 2019, Stumberg et al., 2022).

Deep networks can augment the BA system by predicting correspondences or IMU biases, with all optimization steps fully differentiable for self-supervised continual adaptation (Pan et al., 27 May 2024).
Multimodal priors (LiDAR, GNSS, wheel odometry) can be incorporated in a pose-graph or factor-graph setting for global consistency (Zhou et al., 20 Mar 2024, Ding et al., 2018).

4. Factor-Graph Structure, Observability, and Loop Closure

Factor Graph Model

VIBA can be formalized as a factor graph with nodes (poses, velocities, biases, points) and factors (reprojection, IMU, priors). Sparse connectivity ensures scalability.

Visual-inertial coupling enforces mixture observability, especially in pitch/roll degrees of freedom. Factor design, such as explicit roll–pitch constraints, eliminates underconstrained DOFs (Usenko et al., 2019).
Loop closure is managed by additional constraints (e.g., pose-graph relative transformation recovered by local/global BA), followed by loop-closure BA for drift correction (Quan et al., 2017, Zhang et al., 2023).

Delayed Marginalization and Information Recovery

Delayed marginalization allows updating marginalization priors with new linearization points, providing more accurate scale/bias estimation and efficient tightly coupled IMU initialization through pose-graph bundle adjustment (Stumberg et al., 2022).

Observability

Without inertial cues or priors, vision-only BA maintains a 4-DOF gauge freedom (3D translation + yaw). Inertial constraints (especially roll–pitch) render additional DOFs observable, yielding globally consistent estimation (Usenko et al., 2019, Quan et al., 2017).

5. Implementation and Experimental Practice

Algorithmic Pipeline

Typical real-time VIBA pipeline:

Ingest image and IMU data.
Preintegrate IMU measurements over each interval.
Add keyframe when parallax/disparity exceeds threshold.
Form visual/inertial (and optionally cross-modal) residuals.
Linearize and assemble the normal equations (Gauss–Newton or Levenberg–Marquardt).
Marginalize oldest states via Schur or QR.
Solve system using block-sparse solvers, update state.
For learned systems, back-propagate BA errors to update network weights (Pan et al., 27 May 2024).

Typical window sizes are 8–12 keyframes (Ding et al., 2018, Song et al., 23 Feb 2025), with solve times below 50ms per frame, enabling real-time deployment on commodity CPUs/GPUs.

Specialized Implementations

Chebyshev polynomial optimization enables time-continuous VIBA, modeling poses as truncated polynomial expansions for direct, low-latency processing (Zhang et al., 1 Apr 2024).
Multi-fisheye and deep dense systems leverage GPU-based dense residual/Jacobian evaluation with sliding-window optimization and semi-pose-graph BA for large-scale global consistency (Zhang et al., 2023, Zhou et al., 20 Mar 2024).

Handling Dynamic Scenes and Non-Idealities

Dynamic-landmark suppression via inertial-motion prior and probabilistic gating improves robustness in non-static environments, separating static from dynamic feature tracks (Sun et al., 30 Mar 2025). Rolling-shutter models are critical for non-global-shutter cameras (Schubert et al., 2019).

6. Applications, Empirical Performance, and Directions

VIBA enables high-precision, robust, and scale-consistent visual-inertial odometry and SLAM across robotics, automotive, AR/VR, and mobile domains.

Reported performance in representative datasets:

EuRoC MAV: translation ATE reduction to 0.05m, rotation error 0.11°, sub-40ms solve times (Song et al., 23 Feb 2025).
Robustness to dynamic scenes with 40% ATE reduction vs. VINS-Fusion; negligible computational overhead (Sun et al., 30 Mar 2025).
Dense mapping, real-time, and global driftless operation through GNSS/Map constraints and deep dense modules (Zhou et al., 20 Mar 2024).
Continual online learning yields up to 50% ATE improvement vs. non-adaptive baselines (Pan et al., 27 May 2024).

Limitations remain in degeneracy handling (e.g., low-parallax motion), sensitivity to calibration and parameter tuning, and real-time demands for dense/global optimization at scale (Usenko et al., 2019, Zhang et al., 1 Apr 2024).

7. Future Developments and Research Frontiers

Open research directions include:

Tight coupling of additional modalities (barometer, wheel odometry, multiple-cameras) and in-situ online extrinsic calibration (Usenko et al., 2019, Zhou et al., 20 Mar 2024).
Further reducing computational complexity via structureless methods, efficient marginalization, or continuous-time formulations (Song et al., 23 Feb 2025, Zhang et al., 1 Apr 2024).
Improved handling of dynamic objects and nonrigid environments (Sun et al., 30 Mar 2025).
Fully differentiable, end-to-end visual-inertial learning systems with closed feedback from BA (Pan et al., 27 May 2024).
Fast reliable initialization under scale/bias uncertainties via delayed-marginalization and pose-graph BA (Stumberg et al., 2022).

VIBA continues to be foundational to high-performance, robust, real-time visual-inertial state estimation and mapping, evolving rapidly through integrated algorithmic, machine learning, and sensor fusion advances.