Papers
Topics
Authors
Recent
2000 character limit reached

SD-6DoF-ICLK: Sparse & Deep SE(3) Alignment

Updated 17 December 2025
  • The paper introduces SD-6DoF-ICLK, a method that integrates analytic ICLK with deep feature weighting to deliver robust 6DoF image alignment.
  • It leverages sparse depth and CNN feature pyramids to optimize SE(3) poses efficiently, significantly improving accuracy over classical approaches.
  • Empirical results show subpixel accuracy and real-time performance, making it highly suitable for SLAM and robotic vision in challenging conditions.

Sparse and Deep Inverse Compositional Lucas-Kanade on SE(3) (SD-6DoF-ICLK) is a modern variant of the inverse compositional Lucas-Kanade (ICLK) algorithm designed to solve robust, accurate image alignment and pose estimation problems in 3D using sparse depth information and deep learning. By integrating the analytic efficiencies of ICLK on SE(3) with learned feature embeddings and robust deep weighting, SD-6DoF-ICLK achieves fast, real-time performance and strong robustness under challenging visual conditions, even when only sparse depth is available in one view (Hinzmann et al., 2021).

1. Overview and Optimization Objective

SD-6DoF-ICLK solves for the relative 6-DoF transformation (corresponding to the Lie group SE(3)) between two RGB images, where the reference image is annotated with a sparse set of 2D features and associated inverse depths (e.g., derived from visual-inertial odometry or SLAM front-ends), and the target image is depthless. The goal is to optimize the SE(3) “twist” vector ξR6\xi \in \mathbb{R}^6 such that a robustified photometric loss is minimized:

E(ξ)=i=1nρ(I0(xi)I1(W(xi;ξ)))E(\xi) = \sum_{i=1}^n \rho\big( I_0(x_i) - I_1(W(x_i; \xi)) \big)

where W(x;ξ)W(x; \xi) uses the inverse depth ziz_i of feature xix_i to back-project xix_i to 3D in I0I_0’s camera frame, transforms it with T=exp(ξ)SE(3)T = \exp(\xi) \in \mathrm{SE}(3), and reprojects into I1I_1:

W(x;ξ)=π(exp(ξ) TC0C1 π1(x,1/z))W(x; \xi) = \pi \Big( \exp(\xi) \ T_{C_0 \rightarrow C_1} \ \pi^{-1}(x, 1/z) \Big)

Here, π1\pi^{-1} and π\pi denote camera back-projection and projection given intrinsics. The robust loss ρ\rho is typically an M-estimator predicted by a learned network.

2. Inverse-Compositional SE(3) Update Strategy

SD-6DoF-ICLK leverages the inverse compositional formulation to maximize computational efficiency. At each iteration:

  1. Residual Computation: Compute ri(ξ)=I0(xi)I1(W(xi;ξ))r_i(\xi) = I_0(x_i) - I_1(W(x_i; \xi))
  2. Jacobian Assembly: At ξ=0\xi = 0, build Ji=I1WWξJ_i = \frac{\partial I_1}{\partial W} \cdot \frac{\partial W}{\partial \xi} for each residual; stack to form JRN×6J \in \mathbb{R}^{N \times 6}.
  3. Normal Equations: With per-point weights wiw_i predicted by a convolutional M-estimator network, solve:

Δξ=(JWJ+λD)1JWr\Delta\xi = \left( J^\top W J + \lambda D \right)^{-1} J^\top W r

WW is diagonal with entries w1,,wnw_1, \ldots, w_n, DD approximates the Hessian of ρ\rho, and λ\lambda is Levenberg–Marquardt damping.

  1. Inverse Composition: The pose is updated via

Tnew=Tcurexp(Δξ)1T_{\text{new}} = T_{\text{cur}} \circ \exp(\Delta\xi)^{-1}

Crucially, all gradients, Jacobians, and Hessians are pre-computed at the template frame (ξ=0\xi=0), enabling significant computational reuse and efficiency.

3. Integration of Sparse Depth

Sparse depths ziz_i attached to the reference frame features xix_i are fundamental. Each feature’s depth is used to lift its 2D image coordinates to a 3D point in the reference camera coordinate system, forming Xi=π1(xi,1/zi)X_i = \pi^{-1}(x_i, 1/z_i). Depths enter the optimization through:

  • The 3D warp: W(x;ξ)=π(exp(ξ)X)W(x; \xi) = \pi(\exp(\xi) X)
  • The Jacobian: W/ξ\partial W / \partial \xi depends on XiX_i, with deeper points yielding proportionally smaller image motion for a given translation.

No explicit depth regularization is required; all geometric linkage occurs through warping and Jacobian construction.

4. CNN Feature Pyramid and Learned Robust M-Estimator

To overcome illumination change, occlusion, and outlier susceptibility, SD-6DoF-ICLK replaces raw pixel intensities with deep feature pyramids and applies a per-pixel learned weight (robust M-estimator):

  • Four-level convolutional neural network (CNN) pyramids f()f_\ell(\cdot) convert I0,I1I_0, I_1 to dense feature maps F0(),F1()F_0^{(\ell)}, F_1^{(\ell)} (with typical C=64C=64 or 128 channels).
  • F1()F_1^{(\ell)} is warped under WW into the reference frame to compute feature residuals r()=F0()F1()Wr^{(\ell)} = F_0^{(\ell)} - F_1^{(\ell)} \circ W at each pyramid level.
  • A small, fully-convolutional network h()h(\cdot) (e.g., 2 conv-ReLU layers + 1×11 \times 1 conv + sigmoid) predicts a per-feature weight w()(0,1)w^{(\ell)} \in (0,1), acting as a learned M-estimator for robust weighting.

All normal equations and optimization are applied at every scale in the feature pyramid (typically four scales), supporting coarse-to-fine optimization and better convergence.

5. Optional Per-Feature Alignment and Bundle Adjustment

After multi-scale SD-6DoF-ICLK, further refinement is possible via:

  • 2D Patch Alignment: For each sparse match, a local 2D inverse compositional Lucas–Kanade aligner on the image patch centered at xix_i refines its coordinate δxi\delta x_i to subpixel accuracy.
  • Bundle Adjustment (BA): Holding refined feature positions fixed, jointly optimize ξ\xi and optionally depths {zi}\{z_i\} by minimizing

EBA(T,{zi})=iρ(uiπ(Tπ1(xi,1/zi))22)E_{BA}(T, \{z_i\}) = \sum_{i} \rho(\|u_i - \pi(T \pi^{-1}(x_i, 1/z_i))\|_2^2)

with robust Cauchy loss and Levenberg–Marquardt optimization (e.g., in GTSAM). This final step achieves subpixel and sub-centimeter alignment accuracy (Hinzmann et al., 2021).

6. Empirical Performance and Characteristics

Experiments on synthetic satellite imagery at 752×480752 \times 480 resolution demonstrate:

Mean pixel error Mean translation error Mean rotation error
Initial (random guess) 34\approx 34 px $4.93$ m $0.075$ rad
SD-6DoF-ICLK alone $1.29$ px $3.26$ m $0.020$ rad
After per-feature alignment $0.41$ px $3.26$ m $0.020$ rad
After full BA $0.12$ px $0.089$ m $0.000$ rad

The classical sparse 6DoF-ICLK (without deep M-estimator) remains stuck at high error levels. Runtime on an RTX 2080 Ti is 145\approx 145 ms per image pair, supporting real-time operation (Hinzmann et al., 2021).

7. Advantages, Limitations, and Context

Advantages:

  • Learned features and M-estimator enable robustness against outliers, large illumination changes, and specularities, extending the basin of convergence compared to purely analytic ICLK.
  • Only sparse depths in the reference frame are required, matching visual-inertial odometry/SLAM data assumptions.
  • GPU-optimized, batched implementation supports end-to-end training and deployment at practical speeds.
  • Optional per-feature alignment and bundle adjustment yields accuracy approaching or exceeding that of classical heavy photometric bundle adjustment.

Limitations:

  • Requires reasonable initialization (within tens of pixels) to avoid local minima, although the deep weighting greatly extends the working range.
  • Sensitive to reference depth quality; degraded or missing depth for many features impairs performance.
  • CNN feature computation increases memory and compute load, though this is offset by the efficiency of the inverse compositional update and GPU acceleration.

Within the landscape of learned and hybrid direct image alignment methods, SD-6DoF-ICLK exemplifies the integration of deep learning-based robustness with the analytic and computational efficiencies of inverse compositional Lucas–Kanade on SE(3), targeting sparse but geometrically accurate 3D vision pipelines for robotic and SLAM tasks (Hinzmann et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Sparse and Deep Inverse Compositional Lucas-Kanade on SE(3) (SD-6DoF-ICLK).