Sparse Bundle Adjustment Layer

Updated 11 April 2026

Sparse Bundle Adjustment Layer is a differentiable, GPU-accelerated implementation of bundle adjustment within PyTorch that leverages sparse computation.
It exploits the inherent sparsity in the Jacobian/Hessian matrices from factor graph modeling, improving scalability in applications like SLAM and photogrammetry.
Integration with PyPose enables efficient, end-to-end optimization in modern deep learning pipelines, achieving significant speedups over traditional methods.

A Sparse Bundle Adjustment Layer is a fully differentiable and GPU-accelerated implementation of bundle adjustment (BA) designed for integration within modern deep learning pipelines, specifically leveraging PyTorch’s eager-mode computation. It addresses the need for flexible, efficient, and natively differentiable BA in large-scale perception applications such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry, where deep neural networks are becoming pervasive. The layer leverages problem structure—specifically, the sparsity in the Jacobian/Hessian matrices induced by the underlying factor graph of BA—while providing a user-facing interface tightly coupled with PyPose and PyTorch for both research and production environments (Zhan et al., 2024).

1. Mathematical Foundations

Bundle adjustment jointly optimizes camera poses and 3D landmark positions by minimizing the sum of squared reprojection errors. Let $C$ denote the number of cameras and $P$ denote the number of 3D points. The $i$ th camera pose is $\zeta_i \in \mathrm{SE}(3)$ , parameterized either with quaternion + translation (7D) or via the Lie algebra. The $j$ th 3D landmark is $p_j \in \mathbb{R}^3$ . Each observation provides a 2D image location $\mathbf u_{ij} \in \mathbb{R}^2$ of landmark $p_j$ in camera $i$ , and $K_i$ is the corresponding intrinsic matrix. The standard pinhole projection function is denoted $P$ 0. The cost function is: $P$ 1 where $P$ 2 denotes optional priors or regularizers.

The optimization is a non-linear least-squares problem, typically solved using the Levenberg–Marquardt (LM) algorithm. The residuals stack into the vector $P$ 3 with parameter vector $P$ 4. LM iteratively solves: $P$ 5 where $P$ 6, $P$ 7 is the damping parameter, and updates are applied in the tangent space for SE(3) components.

2. Sparse Factor Graph Modeling and Linearization

Each reprojection residual $P$ 8 depends only on a specific camera ( $P$ 9) and a specific point ( $i$ 0), leading to extreme sparsity in the Jacobian $i$ 1. This structure is formalized as a bipartite factor graph:

Camera nodes: pose variables $i$ 2.
Point nodes: 3D locations $i$ 3.
Factors: reprojection errors $i$ 4.

The Jacobian $i$ 5 contains only $i$ 6 pose sub-blocks and $i$ 7 point sub-blocks for each observed (visible) $i$ 8 pair. Storing and processing $i$ 9 in PyTorch’s native sparse_BSR (block sparse row) format—using block sizes and indices corresponding to the observation structure—enables efficient memory and compute scaling. The Gauss–Newton or LM step uses the approximate Hessian $\zeta_i \in \mathrm{SE}(3)$ 0, maintaining computational and storage complexity of $\zeta_i \in \mathrm{SE}(3)$ 1, where $\zeta_i \in \mathrm{SE}(3)$ 2.

3. GPU Acceleration, Differentiability, and Eager-Mode Implementation

Sparse Bundle Adjustment Layer employs full GPU acceleration and native differentiability within PyTorch eager mode:

Jacobian and residual computation: The forward pass enumerates all visible $\zeta_i \in \mathrm{SE}(3)$ 3 residuals, replicating camera and point variables to form batched residual computations. The function $\zeta_i \in \mathrm{SE}(3)$ 4 is autograd-differentiable. Block-wise derivatives $\zeta_i \in \mathrm{SE}(3)$ 5 and $\zeta_i \in \mathrm{SE}(3)$ 6 are efficiently computed using torch.func.jacrev and torch.func.vmap, and then assembled into a sparse_BSR matrix.
Sparse linear algebra: Key steps are delegated to cuSPARSE and custom CUDA or Triton kernels:
- SpGEMM $\zeta_i \in \mathrm{SE}(3)$ 7: Performed via PyTorch's sparse_CSR or custom block-sparse logic.
- SpMV $\zeta_i \in \mathrm{SE}(3)$ 8: Native PyTorch sparse dispatch.
- Diagonal manipulation for LM damping: Custom Triton kernels.
- Linear solvers: Direct Cholesky for small/medium systems; PCG with block preconditioning for larger-scale problems.
Eager-mode compatibility: Sparse operators are fully registered with the PyTorch dispatcher, allowing for standard operator overloading (@, solver(A, b)) and seamless gradient propagation through a fixed number of LM iterations.

4. PyPose/PyTorch Integration and User API

The Sparse Bundle Adjustment Layer is implemented as a differentiable PyTorch/PyPose module with minimal API overhead. The typical workflow involves:

Defining a custom residual module as a PyTorch nn.Module subclass, parameterizing camera pose and point tensors, and implementing the residual computation.
Instantiating the model, observation tensors, and the optimizer (e.g., LM), along with trust-region strategies and optional schedulers.
Running the optimizer in an iterative loop, where each BA step entails both forward (residual) and backward (LM update) passes, with gradients flowing through the entire stack.

Example minimal code (from (Zhan et al., 2024)):

$j$ 5

Configuration is exposed via Python for all key hyperparameters, including trust-region schedule, LM iteration count, linear solver selection, and tolerance. The API is intentionally similar to dense LM in PyPose, minimizing code changes when upgrading to sparse, high-performance BA.

5. Empirical Performance and Comparative Analysis

On BAL and 1DSfM datasets, the eager-mode sparse GPU BA achieves dramatic speedups in double precision on NVIDIA RTX 4090 hardware:

Comparator	Speedup Factor vs. Eager-Mode GPU BA
GTSAM	18.5×
g $\zeta_i \in \mathrm{SE}(3)$ 9o	22×
Ceres	23×
DeepLM	56% faster on BAL, 28% faster on 1DSfM

Memory usage is modestly higher than C++-based frameworks due to Python’s GC and PyTorch sparse overhead. For problem sizes under ~1k parameters, Python overhead reduces absolute speedup. For large problems, sparsity and full-GPU execution yield superior scaling; PCG methods may require tuning for very ill-conditioned scenes, while direct Cholesky provides strong robustness for medium-scale systems.

A concise summary of trade-offs:

Runtime: Eager-mode GPU implementation achieves $j$ 0– $j$ 1 speedup versus C++ libraries, and substantial gains over DeepLM.
Memory footprint: Some increase compared to C++ counterparts.
Numerical stability: PCG preconditioners may need tuning; Cholesky is robust but memory-intensive on larger systems.

6. Practical Guidelines and Integration Strategies

To maximize performance and stability, several best practices are indicated:

Data normalization: Center and scale image coordinates for improved conditioning.
Initialization quality: Employs robust initial pose/structure estimates from upstream (e.g., COLMAP, feature-based PnP).
LM Damping: Start with $j$ 2 in $j$ 3, adapting during optimization with trust-region logic.
Solver selection: Use direct Cholesky for problem sizes less than 10k unknowns; otherwise, deploy PCG with tolerances around $j$ 4.
Deep learning pipeline integration:
- Wrap BA as an nn.Module.
- Insert mid-pipeline, e.g., between feature matching and pose regression stages.
- Use autodiff on BA loss to train upstream network weights.
- Limit LM steps during network training to bound memory.

These principles enable embedding a fully GPU-accelerated, differentiable, sparse, second-order bundle adjustment module into any PyTorch workflow, facilitating seamless integration with learned feature matching, depth estimation, or higher-level vision modules.

Markdown Report Issue Upgrade to Chat

References (1)

Bundle Adjustment in the Eager Mode (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse Bundle Adjustment Layer.