Papers
Topics
Authors
Recent
2000 character limit reached

MambaIO: Neural Inertial Odometry Framework

Updated 26 November 2025
  • MambaIO is a neural inertial odometry framework that decouples IMU signals into low- and high-frequency bands via a Laplacian pyramid to enhance trajectory estimation.
  • The architecture employs a dual-branch design: a Mamba state space model for long-range motion and a multi-path convolutional network for local motion details.
  • Evaluation on six public datasets shows MambaIO reduces trajectory errors by 8–15% compared to prior methods, demonstrating significant performance improvements.

MambaIO is a neural inertial odometry (IO) framework designed to recover pedestrian trajectories using only raw 3-axis accelerometer and gyroscope measurements from commodity inertial measurement units (IMUs), processed in the global (gravity-aligned) coordinate frame. It introduces a frequency-decoupled modeling strategy, where inertial signals are split into low- and high-frequency bands via a Laplacian pyramid. The low-frequency component is processed using a Mamba architecture (a linear-time state space model, SSM), which excels at extracting long-range contextual motion cues, while local motion details in the high-frequency band are modeled via multi-path convolutional (MPC) structures. MambaIO demonstrates state-of-the-art trajectory accuracy on six public pedestrian IO datasets, yielding substantial reductions in global and local trajectory errors relative to previous methods (Zhang, 19 Nov 2025).

1. Coordinate Frame Analysis and Motivation

Pedestrian IO seeks to estimate incremental pose (Δp, Δq)(\Delta \mathbf{p},\,\Delta\mathbf{q}) from windowed sequences of IMU measurements (a1:L, ω1:L)(\mathbf{a}_{1:L},\,\boldsymbol\omega_{1:L}). Traditional strapdown integration accumulates drift due to bias and noise via double integration. Learning-based approaches train neural mappings end-to-end over short windows to regress pose increments, enhancing both robustness and accuracy.

A fundamental design choice is the representation of IMU data in either the body frame (IMU-attached axes) or the global frame (gravity-aligned axes). For arbitrarily held phones during pedestrian movement, the global frame leads to temporally smoother, semantically coherent signals. Kinematic analysis—using PCA and t-SNE visualizations—shows that global-frame representations yield more compact, discriminative latent features for human IO tasks, supporting the adoption of global coordinates for MambaIO (Zhang, 19 Nov 2025).

2. Frequency-Decoupled Signal Decomposition via Laplacian Pyramid

To isolate global motion trends from rapid, local fluctuations, MambaIO introduces a differentiable Laplacian pyramid decomposition:

  • For input X∈R6×LX \in \mathbb{R}^{6 \times L} (windowed IMU signals), apply depthwise average convolution (k=5k=5, s=2s=2) to downsample and extract the low-frequency component:

Xld=DWConvavg, k=5, s=2(X)∈R6×⌊L/2⌋X_{ld} = \mathrm{DWConv}_{\text{avg},\,k=5,\,s=2}(X) \in \mathbb{R}^{6 \times \lfloor L/2 \rfloor}

  • Upsample XldX_{ld} to length LL using nearest-neighbor interpolation:

Xlow=NearestUpSample(Xld)∈R6×LX_{\mathrm{low}} = \mathrm{NearestUpSample}(X_{ld}) \in \mathbb{R}^{6 \times L}

  • Compute the high-frequency residual:

Xhigh=X−Xlow∈R6×LX_{\mathrm{high}} = X - X_{\mathrm{low}} \in \mathbb{R}^{6 \times L}

This decomposition yields XlowX_{\mathrm{low}} (slow, global motion trends) and XhighX_{\mathrm{high}} (rapid, localized dynamics), which are modeled in specialized branches.

3. Dual-Branch MambaIO Network Architecture

The MambaIO architecture processes decomposed IMU signals through separate branches before fusion and prediction:

A. Multi-Path Convolution (MPC) High-Frequency Branch

  • Three parallel depthwise convolutions with kernel sizes k∈{1,3,7}k\in\{1,3,7\} (stride 1) extract multi-scale local features from XhighX_{\mathrm{high}}.
  • Outputs are concatenated (R18×L\mathbb{R}^{18 \times L}), passed through an SE block (channel reweighting), then compressed via a 1×11 \times 1 convolution:

XMPC=Conv1×1(SE(Concat(X0,X1,X2)))X_{\mathrm{MPC}} = \mathrm{Conv}_{1 \times 1}\bigl(\mathrm{SE}(\mathrm{Concat}(X_0,X_1,X_2))\bigr)

B. Mamba State Space Model (SSM) Low-Frequency Branch

  • The Mamba block models XlowX_{\mathrm{low}} via learned input gating and convolution, producing two parallel streams:

X1=SSD(σ(Conv(Linear(Xlow)))),X2=σ(Conv(Linear(Xlow)))X_1 = \mathrm{SSD}(\sigma(\mathrm{Conv}(\mathrm{Linear}(X_{\mathrm{low}})))), \quad X_2 = \sigma(\mathrm{Conv}(\mathrm{Linear}(X_{\mathrm{low}})))

  • Streams are concatenated and linearly fused:

XMamba=Linear(Concat(X1,X2))X_{\mathrm{Mamba}} = \mathrm{Linear}(\mathrm{Concat}(X_1, X_2))

  • A self-attention layer is appended post-Mamba to emphasize critical time frames.

C. Branch Fusion and Pose Prediction

The MPC and Mamba streams are concatenated (R12×L\mathbb{R}^{12 \times L}) and passed through a 1×11 \times 1 convolution to produce XoutX_{\mathrm{out}}, which is regressed to pose increments (Δp, Δq)(\Delta\mathbf{p},\,\Delta\mathbf{q}).

Data Flow Schematic

(a,ω)→global frameX→Laplacian pyramid(Xlow,Xhigh)→Mamba streamXMambaandXhigh→MPCXMPC→fusionXout→head(Δp,Δq)(\mathbf{a},\boldsymbol{\omega})\xrightarrow{\text{global frame}} X\xrightarrow{\text{Laplacian pyramid}} (X_{\mathrm{low}}, X_{\mathrm{high}}) \xrightarrow{\substack{\text{Mamba}\ \text{stream}}} X_{\mathrm{Mamba}} \quad\text{and}\quad X_{\mathrm{high}}\xrightarrow{\text{MPC}}X_{\mathrm{MPC}} \xrightarrow{\text{fusion}} X_{\mathrm{out}} \xrightarrow{\text{head}} (\Delta\mathbf{p},\Delta\mathbf{q})

4. Underlying State Estimation Equations and Loss Functions

MambaIO’s regression task implicitly approximates the discrete IO equations under the global frame. The system state at time tt is st=[pt, vt, qt]\mathbf{s}_t = [\mathbf{p}_t, \,\mathbf{v}_t,\,\mathbf{q}_t], with

  • pt∈R3\mathbf{p}_t \in \mathbb{R}^3: position,
  • vt∈R3\mathbf{v}_t \in \mathbb{R}^3: velocity,
  • qt∈SO(3)\mathbf{q}_t \in SO(3): orientation quaternion.

Given (at, ωt)(\mathbf{a}_t,\,\boldsymbol{\omega}_t) (global frame), idealized strapdown integration follows:

\begin{align*} \mathbf{p}t &= \mathbf{p}{t-1} + \mathbf{v}{t-1}\,\Delta t + \tfrac{1}{2}(\mathbf{a}_t - \mathbf{g})\,\Delta t2, \ \mathbf{v}_t &= \mathbf{v}{t-1} + (\mathbf{a}t - \mathbf{g})\,\Delta t, \ \mathbf{q}_t &= \mathbf{q}{t-1}\,\otimes\,\exp(\tfrac{1}{2}\boldsymbol{\omega}_t\Delta t), \end{align*}

where g=(0,0,9.81)\mathbf{g} = (0, 0, 9.81) m/s2^2; ⊗\otimes denotes quaternion product.

The training loss is a weighted sum of position and quaternion errors:

L=∑t=1L[λp ∥p^t−pt∥2+λq ∥q^t⊖qt∥2]\mathcal{L} = \sum_{t=1}^L \Big[ \lambda_p\,\| \hat{\mathbf{p}}_t - \mathbf{p}_t \|^2 + \lambda_q\,\| \hat{\mathbf{q}}_t \ominus \mathbf{q}_t \|^2 \Big]

where ⊖\ominus is an axis-angle quaternion error measure; λp,λq\lambda_p, \lambda_q are hyperparameters.

5. Training Procedure and Empirical Validation

MambaIO is evaluated on six public pedestrian IO benchmarks: RIDI, RoNIN, RNIN, OxIOD, TLIO, IMUNet, encompassing diverse use cases and conditions. The training protocol is as follows:

  • Network: Four hierarchical stages with channel widths [64, 128, 256, 512], followed by the dual MambaIO branches.
  • Optimizer: Adam (initial 1×10−41 \times 10^{-4}, cos/step decay to 1×10−61 \times 10^{-6}).
  • Windows: L=200L = 200; batch size ≈ 64 per GPU (5× NVIDIA RTX 3090, PyTorch 2.5.1).
  • Early stopping at epoch 40.

Evaluation metrics:

  • Absolute Trajectory Error (ATE): Global RMS position error.
  • Relative Trajectory Error (RTE): Sliding-window RMS error (≈\approx1 s), quantifying local drift.

MambaIO achieves consistent improvements over prior methods (RoNIN-ResNet, TLIO) in both ATE and RTE, as summarized:

Dataset Baseline (ATE/RTE) MambaIO (ATE/RTE) % Improvement
RoNIN 0.58 m / 0.32 m/s 0.51 m / 0.28 m/s 12%
TLIO 0.48 m / 0.26 m/s 0.42 m / 0.23 m/s 13%

Across all datasets, MambaIO yields 8–15% reduction in ATE and 9–14% in RTE (Zhang, 19 Nov 2025). Qualitative trajectory plots show close adherence to ground truth, particularly in complex navigational scenarios.

6. Ablation Study and Component Contributions

Two ablated variants were analyzed:

  • Conv-only (MPC branch, no Mamba): average ATE ≈ 0.55 m
  • Mamba-only (no MPC branch): average ATE ≈ 0.57 m

Both outperform RoNIN-ResNet but underperform full MambaIO (ATE ≈ 0.51 m). This confirms that dual-frequency decomposition and branch-specific modeling yield complementary accuracy gains, with MPC capturing local detail and Mamba SSM capturing global context.

7. Conclusions and Prospective Extensions

MambaIO systematically revisits the global-coordinate paradigm for pedestrian inertial odometry, providing theoretical and empirical evidence of its superiority for human motion tracking, due to temporally coherent, gravity-aligned IMU streams. By leveraging Laplacian pyramid-based frequency decoupling and dedicated SSM/convolutional branches, MambaIO jointly models both coarse movement trajectories and fine-grained dynamics. Results on six benchmarks set new SOTA for pedestrian IO (8–15% ATE, 9–14% RTE improvement). Prospective directions include optimizing the SSM branch’s runtime and extending the Laplacian pyramid decomposition to multiple levels to enhance multi-scale motion modeling (Zhang, 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MambaIO.