Visual-Inertial-Leg Odometry

Updated 4 July 2026

Visual-inertial-leg odometry is a state estimation approach that combines vision, inertial sensing, and leg kinematics to determine a legged robot’s motion.
It employs methods such as EKF and factor graphs to fuse high-rate IMU data with visual and contact information, enhancing control and reducing drift.
This fusion method improves dynamic locomotion by providing low-latency, calibrated estimates crucial for managing intermittent contacts and slippage.

Visual-inertial-leg odometry, often abbreviated VILO or VIL-O, denotes state-estimation systems that fuse vision, inertial sensing, and legged kinematics or contact information to estimate the motion state of a legged robot. In this class of estimators, cameras provide geometric constraints between poses, IMUs provide high-rate motion information, and legs provide ground-relative constraints through forward kinematics, stance assumptions, contact events, and related proprioceptive signals. The resulting systems are used to estimate base pose, velocity, orientation, IMU biases, and, in several formulations, additional quantities such as contact states, foothold positions, height bias, landmarks, or kinematic parameters; representative realizations span EKF-based multi-rate fusion, contact-centric unscented filtering, fixed-lag factor graphs, adaptive leg-factor weighting, online kinematic calibration, learned inertial displacement factors, and multi-IMU legged state estimation (Dhédin et al., 2022, Wisth et al., 2021, Yang et al., 2022, Yang et al., 15 Jul 2025).

1. Scope, state variables, and estimation targets

The central estimation target in visual-inertial-leg odometry is the floating-base state of a legged robot under intermittent contact. In dynamic locomotion, the base is not directly measured, contacts appear and disappear, feet may slip, and motion can include large vertical excursions and flight phases. For control, especially nonlinear model predictive control, the estimator is required to deliver base pose, orientation, linear velocity, angular velocity, and reliable height above ground at high rate and with small latency (Dhédin et al., 2022).

A compact EKF formulation for dynamic locomotion uses

$\mathbf{s}_\text{EKF} = \bigl( {^W\mathbf{p}_{WB}, {^W\mathbf{q}_{WB}, {^B\mathbf{v}_{WB}, \mathbf{b}_i^a, \mathbf{b}_i^\omega, b_{\delta z} \bigr),$

where the state includes base position, base orientation quaternion, base linear velocity in the base frame, base IMU biases, and a scalar height bias that compensates VIO drift relative to the ground. In contrast, smoothing-based systems such as VILENS define a keyframe state

$x_i \triangleq \left[R_i,p_i,v_i,\mathbf{b}^g_i,\mathbf{b}^a_i, \mathbf{b}^{\omega}_i, \mathbf{b}^{v}_i \right],$

augmenting the standard visual-inertial state with leg-odometry bias terms, while Cerberus further augments the state with kinematic parameters $\rho$ for online calibration (Wisth et al., 2021, Yang et al., 2022).

Several systems push the state further toward a contact-rich or proprioceptive formulation. COCLO includes base pose, base velocity, IMU biases, the world positions of all feet, and a per-foot contact status $c^i \in [0,1]$ interpreted as probability of contact, thereby making contact status itself part of the latent state (Yang et al., 2019). Multi-IMU proprioceptive odometry extends the state to include, for each foot, world position, world velocity, foot orientation, and foot-IMU biases, yielding an 80-dimensional quadruped state driven by one body IMU and four foot IMUs (Yang et al., 15 Jul 2025).

These state choices encode different assumptions about what is most uncertain. EKF formulations oriented toward control emphasize low latency and explicit height correction. Factor-graph formulations emphasize cross-modal consistency across a sliding window. Contact-centric and multi-IMU formulations emphasize the fact that foot motion and contact quality are not nuisance effects but primary variables in legged locomotion.

2. Leg odometry, contact modeling, and ground-relative constraints

The leg component of visual-inertial-leg odometry begins with forward kinematics. If foot $k$ is in stance and assumed not slipping, a standard velocity observation for the base is

${^B\mathbf{v}_{WB} = -{^B\mathbf{v}_{BK} - {^B\boldsymbol{\omega}_{WB} \times {^B\mathbf{p}_{BK},$

so joint encoders, joint velocities, and base angular velocity yield a direct stochastic measurement of base velocity in the base frame (Dhédin et al., 2022). This is the basic bridge between legged kinematics and inertial navigation.

A major difficulty is that the no-slip stance assumption is often violated. On soft terrain the foot continues moving after contact detection, and on slippery terrain the foot can have non-zero velocity relative to the ground even while nominally in stance. The learned inertial odometry study on ANYmal explicitly reports upward drift in world $z$ on soft ground and large error spikes during slip, because the estimator interprets terrain compression or sliding as base motion (Buchanan et al., 2021). VILENS addresses the same phenomenon by introducing a linear velocity bias term $\mathbf{b}^v$ and replacing the idealized measurement model with

$\tilde{v} = v + \mathbf{b}^v + \eta,$

so that systematic leg-odometry drift due to slippage, terrain deformation, and modeling error is estimated online rather than forced into the pose trajectory (Wisth et al., 2021).

Contact modeling is therefore central. In dynamic locomotion EKFs, contact forces are estimated from joint torques and Jacobians, then thresholded through a Schmitt trigger; a foot is used for leg odometry only after it has been in contact for at least $N_\text{contact}$ consecutive filter steps (Dhédin et al., 2022). COCLO instead makes contact status a state variable and formulates both prediction and measurement models according to whether a leg is in stance or swing, reducing reliance on accelerometer propagation under high impacts (Yang et al., 2019). Multi-IMU odometry replaces the zero-foot-velocity assumption with a pivoting contact model in which the foot-center velocity during stance is represented by

$x_i \triangleq \left[R_i,p_i,v_i,\mathbf{b}^g_i,\mathbf{b}^a_i, \mathbf{b}^{\omega}_i, \mathbf{b}^{v}_i \right],$ 0

and uses a Mahalanobis test on the pivoting residual to reject slipping or swinging feet (Yang et al., 15 Jul 2025).

A recurrent misconception is that leg odometry alone resolves all practically relevant motion modes. The cited literature does not support that view. Leg odometry plus IMU gives good short-term velocity and roll/pitch, but absolute position and yaw remain unobservable and base height is poorly observed under jumps or uneven terrain; conversely, pure VIO yields low-drift pose in a local map but height can drift relative to the ground and latency is often too high for feedback control (Dhédin et al., 2022). Visual-inertial-leg odometry exists precisely because these error modes are complementary.

3. Fusion architectures and estimator classes

The literature contains both loosely coupled and tightly coupled architectures. In a loosely coupled architecture, a vision-inertial estimator runs independently and its output is used as a pseudo-sensor inside a second estimator together with leg odometry and IMU measurements. The dynamic-locomotion framework for Solo12 follows this pattern: a stereo-inertial VIO runs on a vision computer, its optimized state is predicted forward at IMU rate to reduce latency, and an EKF on the robot fuses predicted VIO pose and velocity with a second high-rate base IMU and leg odometry, producing state estimates suitable for 1 kHz control (Dhédin et al., 2022).

Tightly coupled architectures instead place raw or preintegrated measurements from all modalities into one optimization problem. VILENS is a fixed-lag smoothing system in which IMU preintegration factors, preintegrated leg-velocity factors, visual reprojection factors, lidar plane and line factors, and lidar ICP odometry factors all share the same sliding-window state (Wisth et al., 2021). Cerberus keeps the same factor-graph philosophy, but augments the state with online kinematic calibration variables and contact-aware leg factors (Yang et al., 2022). WALK-VIO extends VINS-Fusion by adding leg kinematic factors directly to the back-end optimization and modulating their information matrices online according to walking motion (Lim et al., 2021).

System	Estimator structure	Distinctive leg-related idea
"Visual-Inertial and Leg Odometry Fusion for Dynamic Locomotion" (Dhédin et al., 2022)	VIO prediction + EKF	Height bias $x_i \triangleq \left[R_i,p_i,v_i,\mathbf{b}^g_i,\mathbf{b}^a_i, \mathbf{b}^{\omega}_i, \mathbf{b}^{v}_i \right],$ 1 and low-latency fusion for control
"WALK-VIO" (Lim et al., 2021)	Sliding-window VIO	Walking-motion-adaptive leg constraint factor
"VILENS" (Wisth et al., 2021)	Fixed-lag factor graph	Preintegrated velocity factor with linear velocity bias
"Cerberus" (Yang et al., 2022)	Factor graph	Online kinematic parameter calibration and contact outlier rejection
"Multi-IMU Sensor Fusion for Legged Robots" (Yang et al., 15 Jul 2025)	EKF + factor graph	Foot IMUs and pivoting-contact proprioceptive odometry

The distinction between filter and smoothing is not only computational. Recursive filters such as EKF, IEKF, SR-UKF, and contact-centric UKF provide constant-time updates and are naturally aligned with high-frequency control. Sliding-window factor graphs provide more globally consistent uncertainty handling, allow calibration variables and non-consecutive factors, and make bias observability arguments more explicit. Several systems combine both levels: a high-rate proprioceptive or EKF front-end feeds a lower-rate factor-graph back-end, or a high-rate VIO prediction front-end reduces latency before fusion (Dhédin et al., 2022, Yang et al., 15 Jul 2025).

4. Observability, scale, latency, and complementary sensing

Observability structure is one of the defining technical issues in visual-inertial-leg odometry. For leg odometry plus IMU, absolute position and yaw remain unobservable, while height becomes weakly constrained during flight phases, slip, or terrain irregularity. For pure stereo VIO, scale is observable, roll and pitch are gravity aligned and essentially drift-free, but output rate and latency are often inadequate for control and map-referenced height can drift relative to the terrain (Dhédin et al., 2022). This complementarity motivates nearly every fusion design in the field.

The most explicit treatment of base height appears in the dynamic-locomotion EKF, which introduces the measurement

$x_i \triangleq \left[R_i,p_i,v_i,\mathbf{b}^g_i,\mathbf{b}^a_i, \mathbf{b}^{\omega}_i, \mathbf{b}^{v}_i \right],$ 2

when all legs are in contact for $x_i \triangleq \left[R_i,p_i,v_i,\mathbf{b}^g_i,\mathbf{b}^a_i, \mathbf{b}^{\omega}_i, \mathbf{b}^{v}_i \right],$ 3 steps. Here the kinematics-derived ground height is used to update a scalar VIO height bias. In trotting, this reduces mean $x_i \triangleq \left[R_i,p_i,v_i,\mathbf{b}^g_i,\mathbf{b}^a_i, \mathbf{b}^{\omega}_i, \mathbf{b}^{v}_i \right],$ 4 RPE from $x_i \triangleq \left[R_i,p_i,v_i,\mathbf{b}^g_i,\mathbf{b}^a_i, \mathbf{b}^{\omega}_i, \mathbf{b}^{v}_i \right],$ 5 m for leg-only EKF to $x_i \triangleq \left[R_i,p_i,v_i,\mathbf{b}^g_i,\mathbf{b}^a_i, \mathbf{b}^{\omega}_i, \mathbf{b}^{v}_i \right],$ 6 m for the fused estimator; in jumping, it reduces mean $x_i \triangleq \left[R_i,p_i,v_i,\mathbf{b}^g_i,\mathbf{b}^a_i, \mathbf{b}^{\omega}_i, \mathbf{b}^{v}_i \right],$ 7 RPE from $x_i \triangleq \left[R_i,p_i,v_i,\mathbf{b}^g_i,\mathbf{b}^a_i, \mathbf{b}^{\omega}_i, \mathbf{b}^{v}_i \right],$ 8 m to $x_i \triangleq \left[R_i,p_i,v_i,\mathbf{b}^g_i,\mathbf{b}^a_i, \mathbf{b}^{\omega}_i, \mathbf{b}^{v}_i \right],$ 9 m (Dhédin et al., 2022). The significance is that leg contact is not merely a short-term velocity cue; it is also a ground-referenced vertical datum.

A specialized but influential observability argument comes from downward planar visual-inertial odometry. RaD-VIO models the ground as a plane and estimates frame-to-frame motion from a homography

$\rho$ 0

while a rangefinder directly constrains plane distance $\rho$ 1. Its measurement model

$\rho$ 2

shows how unscaled visual motion becomes metric velocity when fused with height. The paper explicitly notes the analogy to legged robots: $\rho$ 3 can be replaced by body height inferred from stance-foot kinematics or ground-relative leg constraints (Fu et al., 2018). This suggests that planar ground vision and leg-derived height are mathematically closer than their sensor modalities suggest.

Latency is the other half of the observability problem. VIO outputs are often accurate but delayed; leg odometry and base IMU are prompt but drift-prone. The low-latency VIO prediction strategy integrates the last optimized VIO state forward at the VIO-IMU rate and streams the predicted pose and velocity to the robot-side EKF, reducing the stale-measurement problem that otherwise destabilizes aggressive gaits (Dhédin et al., 2022). A related theoretical direction appears in geometric observer design for VIO, where the state is embedded in the Lie group $\rho$ 4 and the translational error is made linear time-varying while attitude is handled by an almost globally stable geometric observer. The same source explicitly proposes extending this structure to VILO by augmenting the state with foothold positions and contact geometry. This suggests a route toward contact-aware observers with stronger convergence guarantees than local linearization can provide (Boughellaba et al., 12 Jan 2026).

5. Adaptive weighting, learned priors, and online model adaptation

A notable trend in visual-inertial-leg odometry is the replacement of fixed leg-factor covariances by adaptive or learned models. WALK-VIO is the clearest hand-crafted example. It computes average image-space feature motion,

$\rho$ 5

forms a covariance over recent feature motion, diagonalizes it, and defines an adaptive factor

$\rho$ 6

The leg residual is then scaled as

$\rho$ 7

Large walking-induced feature motion therefore strengthens leg constraints, while smooth motion weakens them (Lim et al., 2021). The broader implication is that estimator trust allocation can be made motion dependent without explicitly estimating gait class.

A second direction replaces hand-designed contact priors with learned inertial priors. The ANYmal learned inertial odometry study predicts a 3D displacement and covariance from a 1 s gravity-aligned IMU window using a 1D ResNet-18, then injects that prediction either as an EKF pseudo-measurement or as a non-consecutive factor in a fixed-lag graph. In the visual-inertial-kinematic factor graph, adding this learned displacement factor reduces 10 m relative pose error in the underground mine from $\rho$ 8 m to $\rho$ 9 m, a $c^i \in [0,1]$ 0 reduction, and in challenging proprioceptive scenarios it yields a $c^i \in [0,1]$ 1 reduction of relative pose error relative to a traditional kinematic-inertial estimator (Buchanan et al., 2021). The method does not replace visual or leg factors; it supplements them precisely when both are likely to degrade.

The most explicit online-learning architecture appears in the tightly coupled LiDAR-IMU-leg odometry system with foot tactile information. Its neural leg kinematics model consumes

$c^i \in [0,1]$ 2

uses a short temporal window, and outputs a body twist $c^i \in [0,1]$ 3 and continuous contact states. The associated neural adaptive leg odometry factor is

$c^i \in [0,1]$ 4

with $c^i \in [0,1]$ 5 estimated online from recent residuals (Okawara et al., 11 Jun 2025). On campus, the full method attains ATE $c^i \in [0,1]$ 6 m versus $c^i \in [0,1]$ 7 m without online learning and $c^i \in [0,1]$ 8 m without tactile input; on sandy beach, online learning reduces ATE from $c^i \in [0,1]$ 9 m to $k$ 0 m. Although the exteroceptive sensor in that work is LiDAR rather than vision, the same paper explicitly states that the LiDAR factor can be replaced by visual reprojection factors in a visual-inertial-leg system. This suggests that online learned leg priors are not modality specific; they are back-end agnostic.

6. Experimental performance, applications, and limitations

The practical success of visual-inertial-leg odometry is tied to dynamic locomotion. In the Solo12 system, the fused estimator is integrated with a nonlinear model predictive controller and supports trotting and jumping at varying horizontal speeds. Using predicted VIO at $k$ 1 Hz inside the EKF, the estimator attains mean horizontal RPE $k$ 2 m in trotting and $k$ 3 m in jumping, versus $k$ 4 m and $k$ 5 m for the leg-only EKF; mean yaw RPE drops from $k$ 6 to $k$ 7 in trotting and from $k$ 8 to $k$ 9 in jumping (Dhédin et al., 2022). These figures clarify why VILO is not only a mapping problem but a control-enabling subsystem.

Optimization-based legged VIO systems report substantial gains over vision-only baselines. WALK-VIO achieves about ${^B\mathbf{v}_{WB} = -{^B\mathbf{v}_{BK} - {^B\boldsymbol{\omega}_{WB} \times {^B\mathbf{p}_{BK},$0 lower translation RMSE than VINS-Fusion overall, about ${^B\mathbf{v}_{WB} = -{^B\mathbf{v}_{BK} - {^B\boldsymbol{\omega}_{WB} \times {^B\mathbf{p}_{BK},$1 lower RMSE than VINS-Fusion with fixed-weight leg constraints, and about ${^B\mathbf{v}_{WB} = -{^B\mathbf{v}_{BK} - {^B\boldsymbol{\omega}_{WB} \times {^B\mathbf{p}_{BK},$2 improvement over VINS-Fusion on the NMPC datasets where walking motion is larger (Lim et al., 2021). VILENS, evaluated over ${^B\mathbf{v}_{WB} = -{^B\mathbf{v}_{BK} - {^B\boldsymbol{\omega}_{WB} \times {^B\mathbf{p}_{BK},$3 h and ${^B\mathbf{v}_{WB} = -{^B\mathbf{v}_{BK} - {^B\boldsymbol{\omega}_{WB} \times {^B\mathbf{p}_{BK},$4 km on ANYmal robots traversing loose rocks, slopes, mud, dark underground caverns, and feature-deprived areas, reports an average improvement of ${^B\mathbf{v}_{WB} = -{^B\mathbf{v}_{BK} - {^B\boldsymbol{\omega}_{WB} \times {^B\mathbf{p}_{BK},$5 translational and ${^B\mathbf{v}_{WB} = -{^B\mathbf{v}_{BK} - {^B\boldsymbol{\omega}_{WB} \times {^B\mathbf{p}_{BK},$6 rotational errors compared to a state-of-the-art loosely coupled approach (Wisth et al., 2021).

Long-range outdoor operation highlights the value of calibration and contact robustness. Cerberus reports that calibrating kinematic parameters within the state estimator can reduce estimation drift to lower than ${^B\mathbf{v}_{WB} = -{^B\mathbf{v}_{BK} - {^B\boldsymbol{\omega}_{WB} \times {^B\mathbf{p}_{BK},$7 during long distance high speed locomotion. On the ${^B\mathbf{v}_{WB} = -{^B\mathbf{v}_{BK} - {^B\boldsymbol{\omega}_{WB} \times {^B\mathbf{p}_{BK},$8 m track sequence, VINS yields ${^B\mathbf{v}_{WB} = -{^B\mathbf{v}_{BK} - {^B\boldsymbol{\omega}_{WB} \times {^B\mathbf{p}_{BK},$9 drift, VILO without calibration $z$ 0, and Cerberus $z$ 1; on the $z$ 2 m campus sequence, VINS fails, VILO without calibration yields $z$ 3, and Cerberus yields $z$ 4 (Yang et al., 2022). Multi-IMU VILO extends this line by correcting a major error source in proprioceptive odometry with foot IMUs and then injecting the resulting velocity estimate into a factor graph with cameras. In a garage experiment longer than $z$ 5 m, it maintains less than $z$ 6 drift while alternative methods diverge in the indoor section (Yang et al., 15 Jul 2025).

A persistent theme across the literature is that failure modes are modality specific. COCLO shows that a contact-centric leg odometry can outperform VINS-Fusion on flat ground, ramps, and stairs under unstable motion, but it also notes that contact modeling and slippage remain open problems and that tighter coupling with visual estimation is still desirable (Yang et al., 2019). Learned inertial and learned leg models reduce drift on slip, soft ground, deformable terrain, and visually degraded environments, but their training is robot specific and generalization to other morphologies or terrain distributions is not guaranteed (Buchanan et al., 2021, Okawara et al., 11 Jun 2025). Downward homography-based modules provide strong scale and height observability on approximately planar terrain, but their effectiveness depends on local planarity and camera placement (Fu et al., 2018).

The resulting picture is not that visual-inertial-leg odometry has converged to a single canonical form. Rather, the field has established a set of stable design principles: use vision to anchor pose and yaw when features are available; use legs to provide ground-relative velocity and height; model contact violations explicitly, either through bias states, adaptive factor weighting, or learned motion priors; reduce latency by prediction or by separating control-rate and optimization-rate estimators; and, whenever possible, formulate these constraints in a shared probabilistic back-end so that bias, slip, contact quality, and calibration become observable through cross-modal consistency.