DefVINS: VIO in Deformable Scenes
- DefVINS is a visual-inertial odometry framework that fuses rigid, IMU-anchored estimation with embedded deformation graphs to track non-rigid environments.
- It employs a tightly-coupled sliding-window optimizer with conditioning-driven activation to manage rigid and non-rigid state components effectively.
- Empirical studies demonstrate up to 80% reduction in trajectory errors under high-deformation conditions compared to traditional VIO systems.
DefVINS is a visual-inertial odometry framework designed to address state estimation in deformable scenes, where classical rigidity assumptions are violated and traditional VIO systems exhibit drift or overfit to non-rigid motion. DefVINS integrates a rigid, IMU-anchored state estimator with a non-rigid deformation module based on an embedded deformation graph. Its architecture, initialization pipeline, mathematical and optimization framework, observability analysis, and conditioning-driven activation strategy collectively enhance robustness and accuracy in environments exhibiting non-rigidity, as validated by quantitative ablation studies and benchmark datasets (Cerezo et al., 2 Jan 2026).
1. System Architecture and Initialization
DefVINS consists of two subsystems operating in a tightly-coupled sliding-window optimization:
- Rigid, IMU-Anchored Estimator: Processes high-rate IMU data (accelerometer, gyroscope) and keyframe image poses; optimizes for camera poses , velocities , IMU biases , and gravity direction .
- Non-Rigid Deformation Module (Embedded Deformation Graph): Builds a sparse deformation graph from long-term feature tracks, estimating per-node 3D positions for each keyframe.
System initialization invokes a standard rigid VIO pipeline akin to VINS-Mono (Qin et al., 2017, Wu, 2019). Initial estimates for scale, gravity direction, gyro bias, accelerometer bias, and velocity are fixed during early keyframes while non-rigid degrees of freedom are "locked". Activation of non-rigid parameters occurs only after the marginal Hessian of the rigid subsystem achieves sufficient conditioning (condition number below a threshold ), preventing early ill-posedness.
2. Mathematical Formulation
The state vector in DefVINS for a window of keyframes is: where comprises deformation graph node positions.
Rigid Subsystem
IMU kinematics follow: Preintegration yields , , with corresponding residuals:
Non-Rigid Deformation Graph
Node positions are updated per keyframe as . Scene point deformation is modeled: where weights are Gaussian, normalized over nearest nodes.
Regularization terms include:
- Elastic: Maintains rest-lengths:
- Viscous: Penalizes abrupt node motion:
- Photometric: Semi-direct brightness constancy for nodes:
Joint Cost Function
The total loss for the window: where sums elastic, viscous, and photometric regularization over graph edges and nodes.
3. Observability Analysis
DefVINS incorporates a detailed observability assessment:
- Joint System Rank: Stacking Jacobians of all residuals produces observability matrix . For persistently exciting IMU motion and sufficient deformation graph coverage, the local system is observable up to a four-dimensional SE(3) gauge (global position and yaw). Non-rigid modes are fully constrained except for this gauge (Cerezo et al., 2 Jan 2026).
- Role of IMU Anchoring: IMU measurements determine metric scale and gravity direction, preventing deformation nodes from compensating for rigid drift, thus improving identifiability of both rigid and non-rigid state components.
Empirically, the joint system's conditioning (measured by the smallest ratio of singular values) improves rapidly when both IMU and non-rigid modules are activated.
4. Conditioning-Driven Progressive Activation
DefVINS employs a conditioning-based strategy for non-rigid node activation:
- Activation Metric: The condition number of the non-rigid Hessian block is computed at each keyframe. New deformation nodes are only unlocked and incorporated into optimization when (typically ), ensuring well-posed estimation.
- Optimization Loop: The sliding-window optimizer utilizes Google Ceres (Levenberg–Marquardt, autodiff), robust Huber kernels, marginalizes oldest keyframes, and maintains a sparse deformation graph (≈200 nodes). Single-thread performance achieves ~20 ms per window update at ~10 Hz keyframe rate.
5. Experimental Evaluation
DefVINS was benchmarked on both synthetic and real datasets:
- Synthetic (Drunkard’s Dataset): 19 RGB-D sequences, 4 deformation levels (L0–L3). Full DefVINS outperformed visual-only, rigid VIO, and NR-SLAM baselines by 30–50% as deformation increased, with errors at L3 (extreme deformation) dropping from 53.1 mm (ORB-SLAM3) to 19.6 mm (DefVINS) (Cerezo et al., 2 Jan 2026).
- Real RGB-D-IMU: 7 real sequences with varying deformability. DefVINS sustained >80% trajectory coverage under high deformation, while ORB-SLAM3 failed (<20% tracking) in these settings. ATE reduction was approximately 80% for high-deformation cloth sequences.
Ablation studies confirmed the necessity of joint rigid/non-rigid estimation: visual-only versions exhibited severe drift; rigid-only IMU stabilization failed to track deformations; full DefVINS achieved consistent global and local accuracy.
6. Context, Limitations, and Prospective Extensions
- Scope and Limitations: The framework assumes sparse graph structure and well-behaved singular value spectra; extreme under-excitation or poor node distribution may still degrade performance. Deformation node activation depends critically on the conditioning metric and IMU excitation.
- Potential Extensions: Integrating high-order geometric constraints, denser graph connectivity, adaptive regularization parameters, or multi-modal sensory fusion (e.g., depth, tactile arrays) can further improve performance in severe non-rigid or textured scenes.
- Comparative Impact: DefVINS systematically advances deformable-scene odometry, exceeding state-of-the-art methods in drift suppression and deformation tracking under metric conditions. This robustness is directly attributable to the interplay of IMU anchoring and conditioning-aware non-rigid estimation (Cerezo et al., 2 Jan 2026).
DefVINS defines a rigorously optimized, observable architecture for visual-inertial odometry in deformable environments, leveraging conditioning-based control of non-rigid degrees of freedom to secure drift-free, metric-scale trajectory estimation (Cerezo et al., 2 Jan 2026).