Traj-Mamba: Efficient Trajectory Encoder

Updated 27 October 2025

Traj-Mamba Encoder is a neural architecture that leverages state-space models to jointly encode fine-grained GPS and road signals with linear time complexity.
Its travel purpose-aware pre-training aligns trajectory embeddings with road and POI semantics using contrastive InfoNCE loss, enhancing interpretability.
The encoder employs knowledge distillation and learnable mask generators to compress trajectories while preserving critical information for downstream tasks.

A Traj-Mamba Encoder is a specialized neural architecture rooted in state-space models (SSMs) and designed for efficient, semantic-rich representation learning of spatio-temporal trajectories—particularly for vehicle GPS data but extensible to other modalities such as robotic motion and autonomous driving. Its distinguishing features are its ability to jointly model both fine-grained motion patterns and high-level travel purposes through parallel GPS and road branches, a travel purpose-aware pre-training pipeline that leverages textual descriptions of roads and points-of-interest (POIs), and an efficient knowledge distillation scheme employing learnable mask generators for trajectory compression. The result is a trajectory encoder that operates with linear time complexity, supports robust and compressed trajectory embeddings, and provides state-of-the-art performance on several benchmark tasks.

1. Joint Modeling of GPS and Road Perspectives

At the architectural core of the Traj-Mamba Encoder is the simultaneous modeling of two complementary trajectory "views": the GPS view (continuous coordinates and time information) and the road view (discrete road segment IDs and categorical-temporal signals). For each trajectory point $\tau_i$ :

The GPS perspective encodes the raw GPS coordinates, relative time delta $\Delta t_i$ , and timestamp (in minutes).
The road perspective encodes a discrete road segment identifier, as well as temporal context such as day-of-week, hour, and minute.
High-order movement features are extracted, including speed $v_i$ , acceleration $\mathrm{acc}_i$ , and movement angle $\theta_i$ .

Each Traj-Mamba block within the encoder consists of two parallel branches:

The GPS-SSM branch processes GPS features using a linear projection and 1D causal convolution (with SiLU activation), followed by a selection mechanism whereby the SSM parameters $(B, C, \Delta)$ are learned as functions of the high-order movement feature vector $S_t$ :

$B = \mathrm{Linear}(S_t), \quad C = \mathrm{Linear}(S_t), \quad \Delta = \sigma_\Delta(\mathrm{Linear}(S_t) + b_\Delta)$

The road-SSM branch mirrors this structure with its own feature embedding and SSM, and its parameters can be optionally conditioned on outputs from the GPS branch, enabling context coupling.

The fusion at each block:

$Z_i^G = \mathrm{Linear}(\mathrm{RMSNorm}(Y_i^G \odot X_i^R)), \quad Z_i^R = \mathrm{Linear}(Y_i^R)$

Here, $Y_i^G$ and $Y_i^R$ are the outputs from the GPS-SSM and road-SSM branches, $X_i^R$ is a gating signal from the road branch, and $\odot$ denotes the Hadamard (element-wise) product. Outputs are concatenated and pooled over all $L$ blocks to form the final embedding.

The design supports linear computational complexity in sequence length and is robust to variable input lengths.

2. Travel Purpose-aware Pre-training

Recognizing that vehicle trajectories encode more than movement—they implicitly reflect driver intent and trip semantics—the encoder undertakes a dedicated pre-training phase:

Road and POI Views: For each trajectory point, a "road view" is created by embedding the textual road segment description (name, type, etc.) using a frozen LLM, and aggregating with neighboring road segments, including source and destination. Similarly, a "POI view" is constructed for the closest point-of-interest, with its textual description and neighborhood context aggregated.
Contrastive Alignment: The encoder output embedding $z_T$ is aligned with its road and POI views using an InfoNCE loss:

$L_T = \frac{1}{2} (L_T^{\text{Road}} + L_T^{\text{POI}})$

Each individual loss is computed by maximizing the similarity (dot product) between the trajectory embedding and its semantic views (road/POI) while minimizing it with respect to other samples in the batch.

This pre-training stage imbues the embedding space with travel intent information, which is then retained in downstream applications at no additional computational cost at inference.

3. Knowledge Distillation and Trajectory Compression

To address the prevalence of redundant points in real-world vehicle trajectories (e.g., clusters of points when vehicles are stopped or moving smoothly), a two-stage compression strategy is employed:

Rule-based Filtering: Obvious redundancies (such as points at zero speed or on the same road segment during steady speeds) are first eliminated by deterministic heuristics.
Learnable Mask Generation: A stochastic gating function generates a soft mask $m_i$ for each point:

$m_i = g(\mu_i) = \max(0, \min(1, \mu_i + \epsilon))$

where $\mu_i$ is a learnable score and $\epsilon$ is training noise. To encourage sparsity and temporal smoothness, $\mu$ is computed via:

$\mu = \text{MeanPool}(\hat{\mu} \odot \text{Sigmoid}(\text{Mamba}(\mathcal{T}^{\text{pre}}) \hat{\mu}))$

with $\text{Mamba}(\mathcal{T}^{\text{pre}})$ a lightweight SSM and $\hat{\mu}$ a parameter vector.

Compression via Distillation: A teacher Traj-Mamba encoder (pre-trained with the travel purpose-aware loss) is fixed. A student encoder, given the masked (compressed) trajectory, is trained with a composite loss:

$L = \frac{1}{2}(L^{\text{MEC}}_T + L^{\text{mask}}_T)$

Here, $L^{\text{MEC}}_T$ is a maximum entropy coding loss to preserve information, with an explicit formula involving a Taylor expansion of the trace of inner products between embeddings, and $L^{\text{mask}}_T$ (involving the erf function) promotes sparsity and controls compression rate.

This approach allows the student to inherit knowledge from the teacher while using only a compressed subset of trajectory points, yielding efficient yet informative embeddings.

4. Mathematical Underpinnings

The foundational mathematical structure is the selective SSM parameterization at the block level:

SSM evolution in each branch:

$h_{t+1} = A(P_t) \cdot h_t + B(P_t) \cdot u_t$

$y_{t} = C(P_t) \cdot h_t + D(P_t) \cdot u_t$

where $P_t$ is the input at time $t$ , and $A$ , $B$ , $C$ , $D$ are input-conditioned parameter matrices.

The mask generator's stochastic gating and sparsity control, as well as the contrastive InfoNCE loss aligning trajectory and semantic views, are formally specified in loss terms.
Embeddings are pooled across time after final block fusion to produce $z_T \in \mathbb{R}^E$ , a fixed-size semantically-rich vector regardless of the original trajectory length.

5. Empirical Results and Efficiency

Extensive experiments (using real-world Chengdu and Xian datasets) demonstrate:

Superiority over baselines in destination prediction, arrival time estimation, and similar trajectory search.
- Lower mean absolute error (MAE) and root mean squared error (RMSE) for GPS and road segment prediction.
- Higher accuracy at top-1 and top-5 ranks in trajectory retrieval tasks.
- Improved arrival time predictions measured by metrics such as MAPE.
Efficiency: TrajMamba remains lightweight in parameter count and training time owing to its linear SSM foundation and trajectory compression, significantly outperforming Transformer-based models on modeling speed and inference cost.

6. Relationship to State-Space and Masked Autoencoder Models

The use of SSMs in the Traj-Mamba Encoder is rooted in their ability to capture both local and long-range dependencies with linear time complexity, providing a contrast to quadratic-scaling Transformer-based architectures (Huang et al., 13 Mar 2025). The masked autoencoding framework (Chen et al., 2023) contributed critical methodological advances—such as specialized trajectory masking strategies for robust social and temporal information recovery, and continual pre-training to avoid catastrophic forgetting—that directly influenced the design of selective and robust representation learning in Traj-Mamba. Contrastive pre-training for travel purpose alignment, and compression via mask generators and knowledge distillation, all collectively advance the state of the art in trajectory learning.

7. Applications and Broader Implications

The Traj-Mamba Encoder is applicable in a diverse set of real-world and research scenarios:

Intelligent transportation systems: Real-time vehicle trajectory prediction, anomaly detection, and driver behavior analytics.
Urban computing: Improved city-scale arrival time estimation, demand analysis, and traffic planning.
Location-based services: Route recommendation, ride-hailing, fleet routing, and travel purpose inference.
Robotics and imitation learning: Owing to the SSM-based foundation, related work shows direct adoption for real-time motion generation (Tsuji, 4 Sep 2024) and policy learning from demonstrations.

The paradigm of combining efficient SSM-based sequence modeling, semantic pre-training, and learnable trajectory compression sets a template for future encoders requiring both high accuracy and scalable deployment in dynamic, resource-constrained environments.

In summary, the Traj-Mamba Encoder—through selective state-space modeling, joint GPS/road representation, travel purpose-aware pre-training, and efficient knowledge distillation—provides a principled, empirically validated solution for extracting semantic-rich, compressed trajectory embeddings suitable for a wide variety of downstream spatio-temporal analysis, forecasting, and planning tasks in transportation and beyond (Liu et al., 20 Oct 2025, Lin et al., 9 Aug 2024, Huang et al., 13 Mar 2025, Chen et al., 2023, Tsuji, 4 Sep 2024).