Neural Network Odometry

Updated 28 November 2025

Neural network-based odometry is a data-driven approach that estimates relative motion by learning discriminative features from raw sensor inputs.
It employs specialized architectures, including convolutional, recurrent, and attention modules, to extract and fuse cues from modalities like GPR, LiDAR, cameras, and radar.
Benchmark results show that these methods can reduce drift and outperform classical techniques, offering robust performance in noisy or ambiguous settings.

A neural network-based odometry method is a data-driven localization approach that estimates the relative motion between consecutive measurements acquired by a mobile agent (robot, vehicle, capsule, etc.) by leveraging representations learned by deep neural networks. Unlike model-based odometry, which relies on explicit geometric or kinematic formulations, these methods employ convolutional, recurrent, or hybrid neural architectures to infer pose changes directly from raw or minimally processed sensor data—such as images, point clouds, radar returns, or ground-penetrating radar (GPR) B-scans—by learning discriminative features and temporal relationships from large volumes of data. Recent advances demonstrate that neural approaches can match or exceed the accuracy and robustness of classical methods, particularly in scenarios with weak, ambiguous, or noisy environmental structure.

1. Sensor Modalities and Input Preprocessing

Neural odometry methods have been proposed for a wide range of exteroceptive and interoceptive sensing modalities:

Ground Penetrating Radar (GPR) B-scans: Inputs are single-channel 2D grayscale images $\mathbf{B}_{t-1},\,\mathbf{B}_t$ of size $H\times L$ acquired at consecutive time steps, with each column corresponding to a time-sampled radar return. Preprocessing includes Butterworth bandpass filtering, spreading & exponential compensation gain, high-pass (dewow) background suppression, wavelet denoising, and interpolation/stitching to fixed dimensions, followed by normalization to $[0,1]$ to standardize pixel intensities. The necessity of such preprocessing arises due to characteristic GPR noise, weak contrasts, and dynamic range in real-world environments (Wang et al., 21 Nov 2025).
LiDAR Point Clouds: Representations include (a) 2D projections to depth or intensity images; (b) sparse voxel grids; (c) lower-dimensional point feature descriptors computed via auxiliary networks (e.g., semantic segmentation backbones); and (d) raw point clouds paired for registration. Additional processing may include PCA compression of local features, normal estimation, or explicit geometric keypoint detection (Honda et al., 2022, Zheng et al., 2020, Xu et al., 2020).
Cameras (Monocular/Stereo/RGB-D): Inputs are pairs of RGB images, potentially augmented with predicted or measured depth maps. Preprocessing often involves resizing, stacking, mean subtraction, and photometric normalization (Zhao et al., 2019, Xue et al., 2018).
Radar: Frequency-Modulated Continuous-Wave (FMCW) radar produces polar or Cartesian intensity images, sometimes split into multiple channels reflecting different modulation patterns or Doppler shifts; preprocessing includes resampling, normalization, and per-channel Doppler-aware transformations (Rennie et al., 2023).
Inertial and Wheel Encoders: Interoceptive odometry pipelines use sequences of raw IMU (accelerometer/gyroscope) and wheel-encoder measurements, often as sliding windows across time. Preprocessing includes sensor bias compensation, transformation to vehicle coordinates, and sliding window organization (Jiang et al., 14 Jul 2024, Tang et al., 2021).

These varied sensing streams demand different neural architectures and network-level fusion strategies to optimally extract odometric cues.

2. Network Architectural Paradigms

State-of-the-art neural odometry networks fall into several architectural categories determined by sensor, task granularity, and computational tradeoffs:

Multi-Branch Feature Networks: Architectures such as GPR-OdomNet utilize a ResNet-50 backbone modified for 1-channel GPR B-scans to extract multi-scale feature hierarchies. Feature maps are split into (i) a difference branch (focusing on spatial feature differences across time) and (ii) a similarity branch (extracting global, cross-time semantic similarity). Each branch uses further convolution, batch normalization, attention, and global pooling to yield compact descriptors (Wang et al., 21 Nov 2025).
Tiny Learned Regulators in Geometric Pipelines: In Generalized LOAM, small MLPs are embedded into the generalized iterative closest point (GICP) pipeline to learn optimal data association metrics and per-point covariance regularizations. The MLPs are trained to map PCA-compressed local geometric features to suitable matching embeddings and Mahalanobis weighting eigenvalues, thus supplanting hand-crafted heuristics in classical point-cloud odometry (Honda et al., 2022).
Sequential/Temporal Models: Recurrent Convolutional Neural Network (RCNN) frameworks concatenate convolutional feature extractors with LSTM modules to model temporal dependencies between scan pairs, as in visually or laser-based odometry. This temporal modeling significantly decreases cumulative drift compared to frame-wise estimation (Valente et al., 2019, Turan et al., 2017, Xue et al., 2018).
Attention and Matching Modules: Methods such as LodoNet use 2D spherical projections of LiDAR data, with SIFT keypoint detection and a PointNet-like selection module to identify, score, and select matched keypoint pairs. Separate regressors estimate rotation (via quaternions) and translation (Zheng et al., 2020).
Direct Regression and Fusion: Some models, such as DeepPCO and deep sensor fusion networks, deploy dual parallel CNNs to regress translation and orientation components directly from stacked depth images or sensor-specific representations, with fusion realized at the feature or regression head level (Wang et al., 2019, Valente et al., 2019).
Uncertainty and Correction Modules: Neural odometry back-ends can produce both motion estimates and associated uncertainty matrices (covariances), feeding these into filtering or smoothing frameworks (EKF, smoothing) or for downstream robust integration (Jiang et al., 14 Jul 2024, Wang et al., 21 Nov 2025).

3. Representative Mathematical Formulations

Neural odometry networks typically map stacks of sensor data to relative pose increments via trainable function approximators:

Difference and Similarity Feature Aggregation

$\Delta \mathbf{F}_{t-1,t} = |\mathbf{F}_{t,d} - \mathbf{F}_{t-1,d}|\,,$

$D_{t-1,t} = \text{GAP}[\Delta\mathbf{F}^{\text{Conv}} \odot CA(\Delta\mathbf{F}^{\text{Conv}}) \odot SA(\Delta\mathbf{F}^{\text{Conv}})]\,,$

where $CA(\cdot)$ and $SA(\cdot)$ denote channel and spatial attention, and GAP denotes global average pooling.

Cosine Similarity and Regression

$CS(\mathbf{F}_{t-1,s}, \mathbf{F}_{t,s}) = \frac{\mathbf{F}_{t-1,s}{\cdot}\mathbf{F}_{t,s}}{\|\mathbf{F}_{t-1,s}\|_2\ \|\mathbf{F}_{t,s}\|_2}\,,$

$\Delta\hat{d}_{t-1,t} = f_\theta(D_{t-1,t}, S_{t-1,t})\,,$

where $f_\theta$ is a multi-layer regressor producing the estimated traveled distance (Wang et al., 21 Nov 2025).

Tiny MLP Regulators in GICP

$j(i) = \arg\min_j \left\{ \|p^a_i - p^b_j\|^2 + \|\bar{f}^a_i - \bar{f}^b_j\|^2 \right\}\,,$

where $\bar{f}$ are MLP-processed PCA features, improving over Euclidean nearest-neighbor association (Honda et al., 2022).

Temporal Odometry Output via RCNN

$\Delta p_t = W_{out} h^{(2)}_t + b_{out}\,,$

where $h^{(2)}_t$ is the hidden state of the second LSTM after feature extraction, outputting translation and rotation increments (Valente et al., 2019).

4. Objectives, Loss Functions, and Training Regimes

Regression and RMSE Losses: The dominant approach for supervised odometry is regression to pose increments, penalized with mean squared error (MSE) or root mean squared error (RMSE):

$\mathcal{L}(\theta) = \sqrt{ \frac{1}{N} \sum_{i=1}^N (O^{gt}_i - \Delta\hat{d}_i)^2 }$

or, for full 6-DoF,

$L = \|\Delta \hat{\mathbf{x}} - \Delta \mathbf{x}\|_2 + \beta \|\Delta \hat{\boldsymbol{\theta}} - \Delta \boldsymbol{\theta}\|_2$

with $\beta$ controlling the scaling between translation and rotation (Wang et al., 21 Nov 2025, Zhao et al., 2019, Turan et al., 2017).

Self-supervised Losses: For cases with unlabelled data, e.g., SelfVoxeLO, training employs geometric self-consistency objectives such as nearest-neighbor residuals, spherical reprojection losses, and ICP-based correction terms, often with learnable uncertainty weights (Xu et al., 2020).
Classification/Ordinal Regression Losses: Some networks recast the regression task as ordinal classification over discretized bins, using binary cross-entropy or categorical losses for improved stability and convergence (Valente et al., 2019).
Training Details: Typical choices include use of Adam or Adamax optimizers, batch sizes from 8 to 128, and learning rates in the range $10^{-4}$ to $10^{-5}$ , decayed by factors each 30–50 epochs. Datasets are split into training and testing trajectories, with cross-validation on benchmark sequences such as KITTI, CMU-GPR, Apollo-SouthBay, and custom collections (Wang et al., 21 Nov 2025, Honda et al., 2022, Xu et al., 2020).

5. Quantitative Performance and Benchmark Comparisons

Neural odometry methods report metrics including:

Root-Mean-Square Error (RMSE) over relative pose increments or cumulative absolute trajectory error (ATE):

$\text{ATE RMSE} = \sqrt{ \frac{1}{2T} \sum_{t=1}^T [(x_t-\hat{x}_t)^2 + (y_t-\hat{y}_t)^2] }$

Percent improvement over state-of-the-art: For example, GPR-OdomNet achieves an overall weighted ATE RMSE of 0.449 m—a 10.2% reduction compared to the previous best method (DEC, 0.500 m). In per-dataset analysis, RMSE values consistently outperform earlier similarity-only or difference-only neural feature architectures (aggregated reduction ~69.1% over feature concatenation baselines). Full network ablation confirms that combining difference and similarity branches yields statistically significant gains (Wang et al., 21 Nov 2025).
Generalization and Robustness: Networks pretrained or fine-tuned on diverse environments (e.g., indoor/outdoor, grass/paved, varying atmospheric or ground conditions) demonstrate strong generalization capabilities, as neural submodules adapt to challenging, degenerate, or non-stationary contexts (Okawara et al., 12 Jul 2024, Honda et al., 2022, Wang et al., 21 Nov 2025).

6. Ablation Studies and Design Principles

Empirical ablation studies highlight several design insights:

Difference vs. Similarity Cues: In GPR-based odometry, spatial difference cues are generally more informative for B-scan change detection than global similarity, but optimal results are obtained when both cues are fused (Wang et al., 21 Nov 2025).
Tiny Neural Regulators: Minimal MLP inserts suffices to replace hand-tuned geometric heuristics in GICP-style pipelines, benefiting both data association robustness and covariance regularization (Honda et al., 2022).
Temporal Sequence Modeling: The addition of recurrent units (LSTM/GRU) to convolutional feature extractors dramatically reduces cumulative drift compared to CNN-only models, especially in sequential settings such as visual and laser odometry (Valente et al., 2019, Turan et al., 2017).
Attention and Selection Modules: PointNet-style attention mechanisms and keypoint selection modules efficiently downweight outlier/dynamic object matches in LiDAR odometry, improving SIFT-based 2D-3D matching performance (Zheng et al., 2020).

7. Contextual Significance and Limitations

Neural network-based odometry methods are rapidly advancing the ability to robustly localize in degraded, ambiguous, or sensor-limited environments. Their key advantages include adaptivity to sensor noise and environment, avoidance of brittle hand-engineered features, and the capability for end-to-end learning. However, tradeoffs persist regarding interpretability, drift control over extremely long horizons, and sensitivity to training domain mismatch. Integration with explicit filtering, loop-closure, or simultaneous localization and mapping (SLAM) backends is a leading area of research, with promising early demonstrations of tightly coupled neural-geometric pipelines achieving near state-of-the-art accuracy on targets with sub-meter precision, even under adverse conditions (Wang et al., 21 Nov 2025, Honda et al., 2022, Xu et al., 2020, Valente et al., 2019).