Papers
Topics
Authors
Recent
Search
2000 character limit reached

DWSFormer: Transformer-Based Inertial Odometry

Updated 7 February 2026
  • DWSFormer is a transformer-inspired inertial odometry framework that leverages implicit nonlinear high-dimensional feature mapping and dual-wing collaborative attention to tackle drift in complex motion.
  • It integrates multi-scale gated convolutional units to combine local motion cues with global context, enabling robust trajectory reconstruction under highly nonlinear conditions.
  • Empirical evaluations on multiple benchmarks show significant reductions in trajectory error, setting new performance standards compared to state-of-the-art methods.

DWSFormer is a transformer-inspired inertial odometry (IO) framework designed to address the challenges of reconstructing accurate trajectories from consumer-grade inertial sensors under complex, highly nonlinear motion. The architecture introduces a sequence of innovations—implicit nonlinear high-dimensional feature mapping (Star Operation), a collaborative channel-temporal attention mechanism (Dual-Wing), and multi-scale gated convolutional units—enabling robust and efficient modeling of motion dynamics in scenarios where classical and existing deep learning methods typically suffer from significant drift. DWSFormer consistently outperforms state-of-the-art baselines across a suite of challenging benchmarks, yielding substantial reductions in trajectory error and setting new performance standards for lightweight, practical pose estimation (Zhang et al., 22 Jul 2025).

1. Motivation and Problem Setting

Inertial odometry (IO) seeks to recover a moving carrier’s trajectory exclusively from the angular velocity and linear acceleration measurements produced by an Inertial Measurement Unit (IMU). Classical mechanistic solutions—double integration of accelerometer data under Newtonian physics—are acutely susceptible to unbounded error accumulation (drift), stemming from intrinsic sensor noise and bias. Techniques such as Zero-Velocity Updates (ZUPT) and step counting partially mitigate this drift by imposing priors on locomotion (e.g., foot stance intervals); however, these approaches degrade under complex, non-pedestrian, or nonlinear trajectories (sharp turns, loops, irregular movement).

Modern approaches leverage deep neural networks (CNNs, LSTMs) to directly regress velocity or displacement from IMU streams. While such models demonstrate improved accuracy over rigid physics-based filters for near-linear motion, they exhibit systematic failure modes for highly nonlinear trajectories. Two principal limitations are identified (Zhang et al., 22 Jul 2025):

  • Restricted nonlinear representation: Standard deep networks primarily employ pointwise activations (ReLU, GELU, ELU) to introduce nonlinearity, insufficiently capturing higher-order couplings (e.g., simultaneous rotations and accelerations).
  • Imbalanced local-global reasoning: CNNs and LSTMs excel at either capturing short-term local trends or long-term dependencies, but neither integrates both with sufficient fidelity for IO drift management over extended, non-linear paths.

As a result, deep IO networks frequently neglect subtle, correlational motion cues during complex maneuvers, leading to rapid drift and degraded localization accuracy in real-world deployment scenarios.

2. DWSFormer Architecture Overview

DWSFormer (Dual‐Wing Star Transform) constitutes a lightweight framework tailored for advanced inertial odometry under challenging motion regimes. Its architecture is organized as follows (Zhang et al., 22 Jul 2025):

  • Input: Windows of raw IMU signals X0R6×L\mathbf{X}_0 \in \mathbb{R}^{6\times L} (6 channels: accelerometer and gyroscope).
  • Initial Lifting: A 1D convolution (kernel size 3) raises X0\mathbf{X}_0 to X1\mathbf{X}_1, which is then processed through four hierarchical stages.
  • Stage Composition: Each stage applies downsampling (stride-3 convolution and BatchNorm) and stacks NiN_i Dual‐Wing Star Transform Blocks (DWSTB). Each DWSTB is composed of:

    1. Star Operation: Nonlinear feature mapping to RM×L\mathbb{R}^{M\times L}.
    2. Dual‐Wing Star Block (DWSB): Joint channel-temporal attention for global dependency modeling.
    3. Multi‐Scale Gated Convolutional Unit (MSGCU): Fusing local dynamic patterns and global channel context.

Residual connections enclose both DWSB and MSGCU: Yi=Xi+DWSB(star(Xi)),Xi+1=Yi+MSGCU(Yi).\mathbf{Y}_i = \mathbf{X}_i + \mathrm{DWSB}(\mathrm{star}(\mathbf{X}_i)),\quad \mathbf{X}_{i+1} = \mathbf{Y}_i + \mathrm{MSGCU}(\mathbf{Y}_i). A global pooling and shallow regression head then estimate the average per-window velocity v^R2\hat v \in \mathbb{R}^2. Sequential velocity integration reconstructs the estimated trajectory.

3. Star Operation: Implicit Nonlinear Feature Mapping

The Star Operation fundamentally enhances network representational power by projecting IMU features into an implicit high-dimensional nonlinear space, facilitating the capture of complex motion couplings (Zhang et al., 22 Jul 2025). For a step-wise feature xRC\mathbf{x} \in \mathbb{R}^C, an augmented vector x=[x;1]RC+1\mathbf{x}' = [\mathbf{x}; 1] \in \mathbb{R}^{C+1} is processed through two independent pointwise convolutions: u=W2x,v=W3x,WjRM×(C+1).\mathbf{u} = W_2' \mathbf{x}',\quad \mathbf{v} = W_3' \mathbf{x}',\quad W_j' \in \mathbb{R}^{M\times(C+1)}. The Star map is the elementwise product: star(x)=uvRM,\mathrm{star}(\mathbf{x}) = \mathbf{u} \odot \mathbf{v} \in \mathbb{R}^M, which implicitly yields M(M+1)/2M(M+1)/2 quadratic feature terms per coordinate, analogous to a second-order polynomial kernel. Sequential extension yields star(Xi)RM×L\mathrm{star}(\mathbf{X}_i) \in \mathbb{R}^{M\times L}, substantially increasing the capacity for modeling higher-order motion interactions with a compact parameterization.

4. Dual-Wing Collaborative Attention

After nonlinear projection, DWSFormer employs a lightweight dual-path (channel and temporal) attention mechanism to efficiently encode long-range dependencies and global context, while maintaining low computational complexity (Zhang et al., 22 Jul 2025):

  • Channel-wise attention: Temporal averaging and standard deviation computation over channels yield global descriptors, linearly fused and processed through a 1D convolution and sigmoid to produce channel weights. These are broadcast over the temporal axis.

  • Temporal-wise attention: The process is mirrored along the temporal dimension, generating time-step weights broadcast over channels.

The dual outputs are combined: ZDWS=WcZ+WtZ,\mathbf{Z}_{\mathrm{DWS}} = W_c \odot \mathbf{Z} + W_t \odot \mathbf{Z}, where WcRM×LW_c \in \mathbb{R}^{M\times L} and WtRM×LW_t \in \mathbb{R}^{M\times L} are broadcasted weight matrices, efficiently injecting global channel and temporal context. This dual-wing attention exhibits linear scaling in both MM and LL, avoiding the quadratic cost of self-attention layers.

5. Multi-Scale Gated Convolutional Units (MSGCU)

In lieu of standard Transformer feedforward layers, DWSFormer implements multi-scale gated convolutional units to integrate fine-grained and global information:

  • Value branch: Local motion cues are aggregated using a depthwise 1D convolution (e.g., kernel size 3), generating a feature map sensitive to short-term dynamics.
  • Gating branch: A global temporal average forms a latent descriptor, mapped by a two-layer MLP (with nonlinearity and sigmoid/softmax) to produce channel gating weights in (0,1)M(0,1)^M.

Channel-wise gating modulates the convolutional output: MSGCU(Y)=gV,\mathrm{MSGCU}(\mathbf{Y}) = \mathbf{g} \odot \mathbf{V}, with possible extension to multiple branch convolutions of varying kernel size to accommodate multi-scale temporal dependencies.

6. Empirical Results and Computational Properties

DWSFormer was rigorously evaluated on six inertial-only trajectory benchmarks: RoNIN, RIDI, RNIN-VIO, TLIO, OxIOD, and IMUNet (Zhang et al., 22 Jul 2025). Metrics include Absolute Trajectory Error (ATE) and Relative Trajectory Error (RTE). The performance gains over prior art are summarized below:

Dataset SOTA ATE (m) DWSFormer ATE (m) Percent Reduction
RoNIN 4.165 3.948 5.2%
RIDI 2.080 2.033 2.3%
RNIN-VIO 1.844 1.455 21.1%
IMUNet 6.428 4.964 22.8%
TLIO 1.333 1.108 16.9%
OxIOD 4.336 1.484 65.8%

Ablation studies demonstrate that even the DWSTB-only configuration (2.25M params, 21.6M FLOPs) outperforms larger models such as RoNIN-ResNet (4.64M params, 38.3M FLOPs), while addition of MSGCU further improves error rates for modest computational overhead (final: 2.76M params, 25.1M FLOPs). DWSFormer shows tight error distributions: on TLIO, 80% of windows have ATE < 0.08 m, significantly surpassing baselines. Qualitative trajectory visualizations reveal marked reduction in drift, particularly in turn-and-loop scenarios.

7. Prospects and Extensions

Future work is proposed along several axes (Zhang et al., 22 Jul 2025):

  • Integration of heading-estimation modules incorporating magnetometer or equivariant representations to further curtail orientation drift, particularly under loop closures.
  • Extension to full 6-DOF pose estimation by fusing additional sensory modalities (e.g., visual, barometric).
  • Development for deployment on extreme low-resource embedded platforms.

These directions highlight the adaptability of the DWSFormer backbone for broader high-precision localization tasks in unconstrained environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DWSFormer.