DefVINS: Robust Visual-Inertial Odometry
- DefVINS is a robust visual-inertial odometry framework that integrates an IMU-anchored rigid backbone with a deformation graph to handle non-rigid scene changes.
- It decouples rigid motion from non-rigid deformations by activating additional degrees of freedom based on observability thresholds, reducing estimation drift.
- Empirical evaluations demonstrate that DefVINS outperforms traditional VIO methods in dynamic scenarios, maintaining real-time processing at 20–30 Hz.
Visual-Inertial Odometry (DefVINS) is a framework for robust metric ego-motion estimation in environments that violate the rigidity assumption underlying classical VIO pipelines. DefVINS introduces principled modeling of non-rigid scene deformation, anchored to a conventional IMU-coupled rigid backbone, designed to mitigate catastrophic drift and overfitting that arise when visual motion is contaminated by non-rigid environmental changes. The system leverages an embedded deformation graph for explicit non-rigid modeling and employs an observability-driven DoF activation strategy to guarantee well-posed estimation. Empirical validation demonstrates pronounced robustness improvements in dynamic and deformable scenarios relative to rigid VIO baselines such as ORB-SLAM3 and VINS-Mono (Cerezo et al., 2 Jan 2026).
1. Motivation and Problem Formulation
Traditional VIO pipelines such as OKVIS, VINS-Mono, and ORB-SLAM3 assume that all image features correspond to static, rigid 3D points. Under this constraint, parallax is interpreted as pure platform motion, allowing complementary IMU measurements to render translational scale and gravity observable. When the scene is locally or globally deformable (e.g., cloth, human body, flexible cables), the rigid parallax assumption is violated: observed visual motion is an entangled sum of observer motion and non-rigid deformation. Standard VIOs either misattribute non-rigid flow to the platform, leading to global trajectory drift or localize correctly only in intervals where rigid parallax dominates (Cerezo et al., 2 Jan 2026).
DefVINS explicitly decouples IMU-anchored rigid motion from non-rigid deformation, representing the latter as a deformation graph. This approach prevents overfitting the global pose to non-rigid visual signals and preserves metric consistency by anchoring the estimation to the inertial reference frame.
2. System Pipeline and Architecture
The DefVINS pipeline is structured in two major phases:
- Initialization: Standard rigid VIO batch initialization estimates the relative poses, velocities, IMU biases, and gravity direction over a small window using closed-form VIO methods. The rigid trajectory produced serves as the reference for subsequent non-rigid estimation.
- Sliding-Window Optimization: At each optimization cycle, the state vector is partitioned into (i) rigid substate—poses , biases , gravity direction ; (ii) non-rigid substate—positions of the active deformation-graph nodes .
Deformation graph node activation is progressive, regulated by the conditioning of the rigid subsystem Jacobian. Non-rigid DoFs are introduced only when the rigid observability (quantified by the smallest non-trivial singular value of the linearized rigid block) exceeds a threshold, guaranteeing that non-rigid estimation is not attempted under poor excitation or insufficient motion (Cerezo et al., 2 Jan 2026).
The processing modules include:
- Feature detection and feature-to-node association
- IMU preintegration to compute relative rigid motion increments
- Ceres-based back-end nonlinear least-squares, encompassing visual, inertial, and deformation residuals, with marginalization for sliding-window size control
3. Mathematical Formulation
Let be the set of active deformation nodes, and consider two consecutive keyframes . The state vector is
- Deformation graph: Nodes are anchored in the initial frame, and features are rigidly associated to their nearest node. For feature anchored to node 0, its deformed position at time 1 is
2
- Residual definitions:
- Visual reprojection residual for feature 3 in frame 4:
5
where 6 is the observed image measurement and 7 the projection function. - Inertial (IMU preintegration) residuals [cf. IMU integration in VINS-Mono]: \begin{align*} r_{\Delta R} &= \mathrm{Log}\bigl(\Delta\tilde R_{t-1,t}{\top}R_{t-1}{\top}R_t\bigr) \ r_{\Delta v} &= R_{t-1}{\top}\bigl(v_t - v_{t-1} - g\Delta t\bigr) - \Delta\tilde v_{t-1,t} \ r_{\Delta p} &= R_{t-1}{\top}\bigl(p_t - p_{t-1} - v_{t-1}\Delta t - \frac{1}{2}g\Delta t2\bigr) - \Delta\tilde p_{t-1,t} \end{align*} - Gravity residual: 8. - Deformation regularization: - Elastic: 9 - Viscous: 0 with 1 - Photometric: 2
- Total energy (objective) over window 3: 4 where deformation regularization is
5
4. Observability and Conditioning-Based Deformation Activation
In rigid scenes, VIO is observable up to a global SE(3) transformation; IMU integration coupled with sufficient platform excitation renders scale, roll, pitch, and gravity observable, with global pose and yaw as gauge freedoms. The introduction of non-rigid DoFs induces new unobservable modes: purely visual methods cannot distinguish global platform drift from coherent, low-energy deformations (Cerezo et al., 2 Jan 2026).
DefVINS relies on "IMU anchoring": inertial residuals restrict the space of plausible global motions, lifting many otherwise ambiguous deformation modes (cf. augmented observability matrix 6 whose singular spectrum encodes system conditioning). Non-rigid node activation is dynamically gated: nodes are switched on only when the minimum singular value 7 of the rigid sub-Jacobian exceeds threshold 8 (typically 9), ensuring estimation is only attempted under adequate motion excitation.
5. Optimization Strategy and Implementation
DefVINS employs a Gauss–Newton optimizer (Ceres-based), with Schur-complement marginalization of old keyframes to preserve sliding-window tractability. Initial rigid VIO serves as a good linearization point for non-rigid DoFs. To maintain numerical stability:
- Only well-initialized non-rigid DoFs are introduced at each window.
- Robust cost kernels are used on high-leverage visual and photometric residuals.
- Non-convexity is addressed via conditioning-aware progressive DoF activation.
A fixed window size (typ. 0 keyframes) and continuous marginalization bound computational complexity. On standard desktop CPUs, DefVINS achieves 120–30 Hz real-time processing (Cerezo et al., 2 Jan 2026).
6. Experimental Validation and Results
DefVINS was validated on both synthetic and real benchmarks:
| Scene Type | Baseline | ATE-RMSE (mm) | RPE improvement | Coverage |
|---|---|---|---|---|
| Low def. (L0) Synthetic | ORB-SLAM3 / NR-SLAM | ≤ 10 | VI-R: ~20% vs VN | ~All track |
| High def. (L1–L3) Synthetic | ORB-SLAM3 | Full: -30–45% | Full: -20–40% | ↑ tracking |
| Real RGB-D (Industrial sequences) | ORB-SLAM3 | -75–80% (HD) | Full: +35–75ppt | 85–95% vs |
| 20–50% (rigid) |
Ablation studies confirm that:
- Visual-only non-rigid baselines (V-NR) exhibit drift under turns.
- VI-rigid only baselines maintain short-term accuracy but diverge under heavy deformation.
- Full DefVINS maintains both global consistency and local robustness.
7. Extensions and Open Challenges
DefVINS' modular framework is compatible with several proposed extensions:
- Adaptive deformation graph refinement (dynamic node addition/removal, graph topology changes)
- Integration of learned deformation priors or neural-warp fields for stronger non-rigid regularization
- Multi-object or articulated deformation handling (necessary for scenes with multiple independently deforming entities)
- Tighter coupling with dense 3D and semantic representations
Principal open challenges are posed by environments with rapidly varying or topologically nontrivial deformation, and by the computational scalability of large or highly connected deformation graphs under real-time constraints. Future work may focus on mechanisms for efficient graph scaling and incorporation of richer deformation models.
DefVINS thus generalizes classical VIO by embedding an observability-aware, IMU-anchored deformable modeling backend, yielding robust pose estimation even when traditional rigidity assumptions do not hold (Cerezo et al., 2 Jan 2026).