SD-6DoF-ICLK: Sparse & Deep SE(3) Alignment

Updated 17 December 2025

The paper introduces SD-6DoF-ICLK, a method that integrates analytic ICLK with deep feature weighting to deliver robust 6DoF image alignment.
It leverages sparse depth and CNN feature pyramids to optimize SE(3) poses efficiently, significantly improving accuracy over classical approaches.
Empirical results show subpixel accuracy and real-time performance, making it highly suitable for SLAM and robotic vision in challenging conditions.

Sparse and Deep Inverse Compositional Lucas-Kanade on SE(3) (SD-6DoF-ICLK) is a modern variant of the inverse compositional Lucas-Kanade (ICLK) algorithm designed to solve robust, accurate image alignment and pose estimation problems in 3D using sparse depth information and deep learning. By integrating the analytic efficiencies of ICLK on SE(3) with learned feature embeddings and robust deep weighting, SD-6DoF-ICLK achieves fast, real-time performance and strong robustness under challenging visual conditions, even when only sparse depth is available in one view (Hinzmann et al., 2021).

1. Overview and Optimization Objective

SD-6DoF-ICLK solves for the relative 6-DoF transformation (corresponding to the Lie group SE(3)) between two RGB images, where the reference image is annotated with a sparse set of 2D features and associated inverse depths (e.g., derived from visual-inertial odometry or SLAM front-ends), and the target image is depthless. The goal is to optimize the SE(3) “twist” vector $\xi \in \mathbb{R}^6$ such that a robustified photometric loss is minimized:

$E(\xi) = \sum_{i=1}^n \rho\big( I_0(x_i) - I_1(W(x_i; \xi)) \big)$

where $W(x; \xi)$ uses the inverse depth $z_i$ of feature $x_i$ to back-project $x_i$ to 3D in $I_0$ ’s camera frame, transforms it with $T = \exp(\xi) \in \mathrm{SE}(3)$ , and reprojects into $I_1$ :

$W(x; \xi) = \pi \Big( \exp(\xi) \ T_{C_0 \rightarrow C_1} \ \pi^{-1}(x, 1/z) \Big)$

Here, $\pi^{-1}$ and $\pi$ denote camera back-projection and projection given intrinsics. The robust loss $\rho$ is typically an M-estimator predicted by a learned network.

2. Inverse-Compositional SE(3) Update Strategy

SD-6DoF-ICLK leverages the inverse compositional formulation to maximize computational efficiency. At each iteration:

Residual Computation: Compute $r_i(\xi) = I_0(x_i) - I_1(W(x_i; \xi))$
Jacobian Assembly: At $\xi = 0$ , build $J_i = \frac{\partial I_1}{\partial W} \cdot \frac{\partial W}{\partial \xi}$ for each residual; stack to form $J \in \mathbb{R}^{N \times 6}$ .
Normal Equations: With per-point weights $w_i$ predicted by a convolutional M-estimator network, solve:

$\Delta\xi = \left( J^\top W J + \lambda D \right)^{-1} J^\top W r$

$W$ is diagonal with entries $w_1, \ldots, w_n$ , $D$ approximates the Hessian of $\rho$ , and $\lambda$ is Levenberg–Marquardt damping.

Inverse Composition: The pose is updated via

$T_{\text{new}} = T_{\text{cur}} \circ \exp(\Delta\xi)^{-1}$

Crucially, all gradients, Jacobians, and Hessians are pre-computed at the template frame ( $\xi=0$ ), enabling significant computational reuse and efficiency.

3. Integration of Sparse Depth

Sparse depths $z_i$ attached to the reference frame features $x_i$ are fundamental. Each feature’s depth is used to lift its 2D image coordinates to a 3D point in the reference camera coordinate system, forming $X_i = \pi^{-1}(x_i, 1/z_i)$ . Depths enter the optimization through:

The 3D warp: $W(x; \xi) = \pi(\exp(\xi) X)$
The Jacobian: $\partial W / \partial \xi$ depends on $X_i$ , with deeper points yielding proportionally smaller image motion for a given translation.

No explicit depth regularization is required; all geometric linkage occurs through warping and Jacobian construction.

4. CNN Feature Pyramid and Learned Robust M-Estimator

To overcome illumination change, occlusion, and outlier susceptibility, SD-6DoF-ICLK replaces raw pixel intensities with deep feature pyramids and applies a per-pixel learned weight (robust M-estimator):

Four-level convolutional neural network (CNN) pyramids $f_\ell(\cdot)$ convert $I_0, I_1$ to dense feature maps $F_0^{(\ell)}, F_1^{(\ell)}$ (with typical $C=64$ or 128 channels).
$F_1^{(\ell)}$ is warped under $W$ into the reference frame to compute feature residuals $r^{(\ell)} = F_0^{(\ell)} - F_1^{(\ell)} \circ W$ at each pyramid level.
A small, fully-convolutional network $h(\cdot)$ (e.g., 2 conv-ReLU layers + $1 \times 1$ conv + sigmoid) predicts a per-feature weight $w^{(\ell)} \in (0,1)$ , acting as a learned M-estimator for robust weighting.

All normal equations and optimization are applied at every scale in the feature pyramid (typically four scales), supporting coarse-to-fine optimization and better convergence.

5. Optional Per-Feature Alignment and Bundle Adjustment

After multi-scale SD-6DoF-ICLK, further refinement is possible via:

2D Patch Alignment: For each sparse match, a local 2D inverse compositional Lucas–Kanade aligner on the image patch centered at $x_i$ refines its coordinate $\delta x_i$ to subpixel accuracy.
Bundle Adjustment (BA): Holding refined feature positions fixed, jointly optimize $\xi$ and optionally depths $\{z_i\}$ by minimizing

$E_{BA}(T, \{z_i\}) = \sum_{i} \rho(\|u_i - \pi(T \pi^{-1}(x_i, 1/z_i))\|_2^2)$

with robust Cauchy loss and Levenberg–Marquardt optimization (e.g., in GTSAM). This final step achieves subpixel and sub-centimeter alignment accuracy (Hinzmann et al., 2021).

6. Empirical Performance and Characteristics

Experiments on synthetic satellite imagery at $752 \times 480$ resolution demonstrate:

	Mean pixel error	Mean translation error	Mean rotation error
Initial (random guess)	$\approx 34$ px	$4.93$ m	$0.075$ rad
SD-6DoF-ICLK alone	$1.29$ px	$3.26$ m	$0.020$ rad
After per-feature alignment	$0.41$ px	$3.26$ m	$0.020$ rad
After full BA	$0.12$ px	$0.089$ m	$0.000$ rad

The classical sparse 6DoF-ICLK (without deep M-estimator) remains stuck at high error levels. Runtime on an RTX 2080 Ti is $\approx 145$ ms per image pair, supporting real-time operation (Hinzmann et al., 2021).

7. Advantages, Limitations, and Context

Advantages:

Learned features and M-estimator enable robustness against outliers, large illumination changes, and specularities, extending the basin of convergence compared to purely analytic ICLK.
Only sparse depths in the reference frame are required, matching visual-inertial odometry/SLAM data assumptions.
GPU-optimized, batched implementation supports end-to-end training and deployment at practical speeds.
Optional per-feature alignment and bundle adjustment yields accuracy approaching or exceeding that of classical heavy photometric bundle adjustment.

Limitations:

Requires reasonable initialization (within tens of pixels) to avoid local minima, although the deep weighting greatly extends the working range.
Sensitive to reference depth quality; degraded or missing depth for many features impairs performance.
CNN feature computation increases memory and compute load, though this is offset by the efficiency of the inverse compositional update and GPU acceleration.

Within the landscape of learned and hybrid direct image alignment methods, SD-6DoF-ICLK exemplifies the integration of deep learning-based robustness with the analytic and computational efficiencies of inverse compositional Lucas–Kanade on SE(3), targeting sparse but geometrically accurate 3D vision pipelines for robotic and SLAM tasks (Hinzmann et al., 2021).

PDF Markdown Chat (Pro)

References (1)

SD-6DoF-ICLK: Sparse and Deep Inverse Compositional Lucas-Kanade Algorithm on SE(3) (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Sparse and Deep Inverse Compositional Lucas-Kanade on SE(3) (SD-6DoF-ICLK).