DefVINS: VIO in Deformable Scenes

Updated 5 January 2026

DefVINS is a visual-inertial odometry framework that fuses rigid, IMU-anchored estimation with embedded deformation graphs to track non-rigid environments.
It employs a tightly-coupled sliding-window optimizer with conditioning-driven activation to manage rigid and non-rigid state components effectively.
Empirical studies demonstrate up to 80% reduction in trajectory errors under high-deformation conditions compared to traditional VIO systems.

DefVINS is a visual-inertial odometry framework designed to address state estimation in deformable scenes, where classical rigidity assumptions are violated and traditional VIO systems exhibit drift or overfit to non-rigid motion. DefVINS integrates a rigid, IMU-anchored state estimator with a non-rigid deformation module based on an embedded deformation graph. Its architecture, initialization pipeline, mathematical and optimization framework, observability analysis, and conditioning-driven activation strategy collectively enhance robustness and accuracy in environments exhibiting non-rigidity, as validated by quantitative ablation studies and benchmark datasets (Cerezo et al., 2 Jan 2026).

1. System Architecture and Initialization

DefVINS consists of two subsystems operating in a tightly-coupled sliding-window optimization:

Rigid, IMU-Anchored Estimator: Processes high-rate IMU data (accelerometer, gyroscope) and keyframe image poses; optimizes for camera poses $\{R_t, p_t\}$ , velocities $v_t$ , IMU biases $(b^g, b^a)$ , and gravity direction $\hat{g}$ .
Non-Rigid Deformation Module (Embedded Deformation Graph): Builds a sparse deformation graph from long-term feature tracks, estimating per-node 3D positions $\{x_i^t\}$ for each keyframe.

System initialization invokes a standard rigid VIO pipeline akin to VINS-Mono (Qin et al., 2017, Wu, 2019). Initial estimates for scale, gravity direction, gyro bias, accelerometer bias, and velocity are fixed during early keyframes while non-rigid degrees of freedom are "locked". Activation of non-rigid parameters occurs only after the marginal Hessian of the rigid subsystem achieves sufficient conditioning (condition number below a threshold $\lambda_0$ ), preventing early ill-posedness.

2. Mathematical Formulation

The state vector in DefVINS for a window of $N$ keyframes is: $\xi = \Big[ R_{t_k}, v_{t_k}, p_{t_k} \Big]_{k=0}^{N-1} \oplus \Big[ b^g, b^a, \hat{g} \Big] \oplus \xi_\mathrm{NR}$ where $\xi_\mathrm{NR}$ comprises deformation graph node positions.

Rigid Subsystem

IMU kinematics follow: $\dot{R}(t) = R(t)(\omega(t) - b^g(t) - n^g(t))^\wedge, \quad \dot{v}(t) = g + R(t)(a(t) - b^a(t) - n^a(t)), \quad \dot{p}(t) = v(t)$ Preintegration yields $\Delta\widetilde{R}_{ij}$ , $\Delta\widetilde{v}_{ij}$ , $\Delta\widetilde{p}_{ij}$ with corresponding residuals: $r_{\Delta R} = \mathrm{Log}\left(\Delta\widetilde{R}_{ij}^\top R_i^\top R_j\right)$

$r_{\Delta v} = R_i^\top (v_j - v_i - g\Delta T_{ij}) - \Delta\widetilde{v}_{ij}$

$r_{\Delta p} = R_i^\top \left(p_j - p_i - v_i\Delta T_{ij} - \frac{1}{2}g\Delta T_{ij}^2\right) - \Delta\widetilde{p}_{ij}$

Non-Rigid Deformation Graph

Node positions are updated per keyframe as $x_i^t$ . Scene point $X$ deformation is modeled: $W(X; \xi_\mathrm{NR}) = \sum_{i \in \mathcal{N}(X)} w_i(X) \big[X + \delta x_i^t\big]$ where weights $w_i(X)$ are Gaussian, normalized over $K$ nearest nodes.

Regularization terms include:

Elastic: Maintains rest-lengths:

$L_{ij}^\mathrm{elas} = k \frac{(\|x_i^t - x_j^t\| - \|x_i^0 - x_j^0\|)^2}{\|x_i^0 - x_j^0\|}$

Viscous: Penalizes abrupt node motion:

$L_{ij}^\mathrm{visc} = b_{ij} \|s_i^t - s_j^t\|^2, \quad b_{ij} = \exp\left(-\frac{\|x_i^0 - x_j^0\|^2}{2\sigma^2}\right)$

Photometric: Semi-direct brightness constancy for nodes:

$L_i^\mathrm{photo} = \left(I^t(u_i^t) - \alpha_i I^{t-1}(u_i^{t-1}) - \beta_i\right)^2$

Joint Cost Function

The total loss for the window: $\mathcal{L}(\xi) = \sum_{k=1}^{N-1} \Big\{ \|r_{\Delta R}^k\|_{\Sigma_R}^2 + \|r_{\Delta v}^k\|_{\Sigma_v}^2 + \|r_{\Delta p}^k\|_{\Sigma_p}^2 + \|r_g^k\|_{\Sigma_g}^2 + \sum_{m \in \mathcal{F}_k} \|r^v_{m, k}\|_{\Sigma_v'}^2 + \lambda_\mathrm{NR} L_\mathrm{NR}^k \Big\} + \mathcal{L}_\mathrm{prior}$ where $L_\mathrm{NR}^k$ sums elastic, viscous, and photometric regularization over graph edges and nodes.

3. Observability Analysis

DefVINS incorporates a detailed observability assessment:

Joint System Rank: Stacking Jacobians of all residuals produces observability matrix $\mathcal{O}$ . For persistently exciting IMU motion and sufficient deformation graph coverage, the local system is observable up to a four-dimensional SE(3) gauge (global position and yaw). Non-rigid modes are fully constrained except for this gauge (Cerezo et al., 2 Jan 2026).
Role of IMU Anchoring: IMU measurements determine metric scale and gravity direction, preventing deformation nodes from compensating for rigid drift, thus improving identifiability of both rigid and non-rigid state components.

Empirically, the joint system's conditioning (measured by the smallest ratio of singular values) improves rapidly when both IMU and non-rigid modules are activated.

4. Conditioning-Driven Progressive Activation

DefVINS employs a conditioning-based strategy for non-rigid node activation:

Activation Metric: The condition number $\kappa(H_\mathrm{NR})$ of the non-rigid Hessian block is computed at each keyframe. New deformation nodes are only unlocked and incorporated into optimization when $\kappa(H_\mathrm{NR}) < \kappa_\mathrm{thresh}$ (typically $10^8$ ), ensuring well-posed estimation.
Optimization Loop: The sliding-window optimizer utilizes Google Ceres (Levenberg–Marquardt, autodiff), robust Huber kernels, marginalizes oldest keyframes, and maintains a sparse deformation graph (≈200 nodes). Single-thread performance achieves ~20 ms per window update at ~10 Hz keyframe rate.

5. Experimental Evaluation

DefVINS was benchmarked on both synthetic and real datasets:

Synthetic (Drunkard’s Dataset): 19 RGB-D sequences, 4 deformation levels (L0–L3). Full DefVINS outperformed visual-only, rigid VIO, and NR-SLAM baselines by 30–50% as deformation increased, with errors at L3 (extreme deformation) dropping from 53.1 mm (ORB-SLAM3) to 19.6 mm (DefVINS) (Cerezo et al., 2 Jan 2026).
Real RGB-D-IMU: 7 real sequences with varying deformability. DefVINS sustained >80% trajectory coverage under high deformation, while ORB-SLAM3 failed (<20% tracking) in these settings. ATE reduction was approximately 80% for high-deformation cloth sequences.

Ablation studies confirmed the necessity of joint rigid/non-rigid estimation: visual-only versions exhibited severe drift; rigid-only IMU stabilization failed to track deformations; full DefVINS achieved consistent global and local accuracy.

6. Context, Limitations, and Prospective Extensions

Scope and Limitations: The framework assumes sparse graph structure and well-behaved singular value spectra; extreme under-excitation or poor node distribution may still degrade performance. Deformation node activation depends critically on the conditioning metric and IMU excitation.
Potential Extensions: Integrating high-order geometric constraints, denser graph connectivity, adaptive regularization parameters, or multi-modal sensory fusion (e.g., depth, tactile arrays) can further improve performance in severe non-rigid or textured scenes.
Comparative Impact: DefVINS systematically advances deformable-scene odometry, exceeding state-of-the-art methods in drift suppression and deformation tracking under metric conditions. This robustness is directly attributable to the interplay of IMU anchoring and conditioning-aware non-rigid estimation (Cerezo et al., 2 Jan 2026).

DefVINS defines a rigorously optimized, observable architecture for visual-inertial odometry in deformable environments, leveraging conditioning-based control of non-rigid degrees of freedom to secure drift-free, metric-scale trajectory estimation (Cerezo et al., 2 Jan 2026).