Masked Trajectory Model (MTM) Overview

Updated 1 July 2026

Masked Trajectory Model (MTM) is a self-supervised approach that reconstructs missing trajectory segments using diverse masking strategies.
It leverages bidirectional transformers, masked autoencoders, and diffusion networks to enable prediction, infilling, and multi-modal representation learning across varied domains.
Flexible masking patterns such as random, block, and agent-based masking empower MTMs to handle tasks like next-step prediction, imputation, and multi-agent interactions effectively.

A Masked Trajectory Model (MTM) refers to a self-supervised or generative model that reconstructs missing segments of a trajectory sequence—whether states, actions, sensor data, or semantic observations—from the remaining observed portions. MTMs generalize the “masked LLM” paradigm to spatial, spatiotemporal, and multimodal trajectory domains, enabling flexible prediction, infilling, simulation, and representation learning. Modern MTM architectures leverage bidirectional transformers, masked autoencoders, or diffusion networks, and support task versatility with adjustable masking schemes. MTMs have demonstrated strong empirical performance in domains including mobility analytics, decision-making, multi-agent sports, navigation, and scientific trajectory analysis (Garg et al., 28 Sep 2025, Long et al., 23 Jan 2025, Wu et al., 2023).

1. Core Modeling Principles and Mathematical Foundations

An MTM operates by mapping an input trajectory—represented as a token or feature sequence $x = (x_1,\ldots,x_L)$ —to an output $f_\theta(x)$ that fills in masked (unobserved) positions $\mathcal{M} \subset \{1,\dots,L\}$ . The objective is to model the conditional distribution $P_\theta(x_\mathcal{M} \mid x_{\setminus \mathcal{M}})$ , enabling the model to reconstruct or generate the missing elements from partial context.

Typical objective functions for masked modeling include:

Cross-entropy for discrete token infilling:

$\mathcal{L}_{\text{MTM}} = -\frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}} \log P_\theta(x_i\mid x_{\setminus\mathcal{M}})$

as in trajectory token models (Najjar, 2023).

Mean squared error (MSE) for continuous-valued elements:

$\mathcal{L}_{\text{mask}} = \sum_{i\in\mathcal{M}} \|x_i - \hat{x}_i\|^2$

as in both mobility dynamics (Wu et al., 2023) and scientific 3D trajectory applications (Young et al., 4 Feb 2025).

Denoising score matching for masked diffusion models:

$\mathcal{L}(\theta) = \mathbb{E}_{x_0, \epsilon, t, m} \| \epsilon - \epsilon_\theta(\sqrt{\alpha_t}x_0 + \sqrt{1-\alpha_t}\epsilon, t, c_{\text{mask}}) \|^2$

where $c_{\text{mask}}$ encodes observed/masked pattern and context, as in masked conditional diffusion frameworks (Long et al., 23 Jan 2025).

Key to MTM is the stochastic application of diverse masking patterns (random, block, terminal, circadian, agentwise), which supports a variety of downstream inference queries—such as next-step prediction, history infilling, and imputation under arbitrary missingness (Long et al., 23 Jan 2025, Xu et al., 2024).

2. Architectural Mechanisms and Trajectory Tokenization

MTMs typically comprise the following architectural elements:

Input tokenization: Depending on the domain and task, raw trajectories ( $\mathbb{R}^n$ $R^{n}$ , categorical states, semantic views, or spatial points) are discretized or encoded via:
- State-action decompositions for mobility/intelligence (Garg et al., 28 Sep 2025, Wu et al., 2023)
- Spatial tokenizers: location to hexagons, sub-hash or vocabulary reduction (Najjar, 2023)
- Patch-wise grouping for 3D point clouds (Young et al., 4 Feb 2025)
- Per-agent and per-timestep embeddings in multi-agent or sports models (Xu et al., 2024)
Embedding modules: Learnable or MLP-based layers lift raw tokens or features into a shared latent space, augmented with:
- Modality embeddings
- Position or time-step encodings (sinusoidal, learned, or both)
- Agent- or class-specific tokens (Park et al., 2024)
Mask token injection: Masked positions are replaced by a learned [MASK] vector or special token.
Sequence processing backbone: Most commonly a stack of bidirectional transformer layers (self-attention, feed-forward), or in some cases masked autoencoders (MAE), state-space models (e.g., bidirectional temporal Mamba), or conditional diffusion networks (Garg et al., 28 Sep 2025, Long et al., 23 Jan 2025, Chen et al., 2023).
Prediction/filling heads: Each output position routes to a per-modality head (e.g., softmax for categorical, regression for continuous, or decoder for point coordinates), as required by the data modality.

MTMs universally support cross-modal or cross-agent attention, so as to model regularities such as “patterns of normalcy” in mobility, agent–agent interactions in driving or sports, or semantic progression in navigation trajectories (Garg et al., 28 Sep 2025, Li et al., 2023).

3. Masking Strategies and Task Versatility

A hallmark of the MTM paradigm is its reliance on flexible masking strategies, which unify multiple prediction, imputation, and recovery tasks:

Random masking: Each element masked independently with a fixed probability, eliminating copy-over and enforcing global reasoning (Garg et al., 28 Sep 2025, Chen et al., 2023).
Block/contiguous masking: Contiguous spans or sub-trajectories masked, used for history or future infilling (Wu et al., 2023).
Role-based masking: Terminal (future), initial (history), or random-holes for imputation (Xu et al., 2024).
Agent-wise masking: Masking per-agent in team settings or multi-agent systems (Park et al., 2024, Xu et al., 2024).
Circadian/time-based masking: Domain-informed masking on specific temporal segments (e.g., nighttime) (Long et al., 23 Jan 2025).

A single pretrained MTM supports, by simple change of mask pattern at inference, use as:

Task/Role	Mask pattern	Functionality
Forward dynamics	mask next states, observe history	State transition prediction
Inverse dynamics	mask actions, observe consecutive states	Action inference
Behavior cloning	mask only actions given prior states/actions	Policy emulation
Goal/reward prediction	mask rewards, returns	Value estimation
Imputation/in-filling	mask arbitrary subset	Arbitrary missing data recovery

This masking flexibility underpins task conversion and multitask learning without retraining (Wu et al., 2023, Wen et al., 2024).

4. Implementation and Optimization Details

Training details, hyperparameter selection, and model capacity directly affect MTM performance:

Network depth and width: Empirical findings indicate diminishing returns past moderate-depth transformers (e.g., 4–12 layers, 128–768 dims), with 4–6 heads per layer typical (Garg et al., 28 Sep 2025, Najjar, 2023).
Batching and masking ratio: Batch sizes of 16–64, with 15–60% masking per batch, are common; extreme mask ratios (>60–80%) degrade performance, while moderate ratios (~15–50%) improve generalization (Chen et al., 2023).
Optimization: Adam(W) optimizer with learning rates in [1e-4, 1e-3] and moderate weight decay; training for tens of thousands up to hundreds of thousands of steps depending on model and data scale (Garg et al., 28 Sep 2025, Najjar, 2023, Young et al., 4 Feb 2025).
Augmentation and regularization: Dropout and stochastic depth in transformers, auxiliary reconstruction losses, and, in some designs, continual pre-training protocols to address catastrophic forgetting under multi-strategy masking (Chen et al., 2023).
Tokenization scalability: Techniques like sub-hash vocabularies, patch/voxel grouping, or agent-wise aggregation are necessary for large/varying vocabularies or extreme data sparsity (Najjar, 2023, Young et al., 4 Feb 2025).

Dataset-dependent choices—e.g., group radius for particle events, clustering for spatial hexagons, or agent-specificity in multi-agent settings—are made to ensure efficient and robust sequence processing.

5. Downstream Tasks, Evaluation, and Empirical Performance

MTMs, pretrained on large-scale or diverse trajectories, are evaluated and fine-tuned on a variety of tasks without retraining architectural weights:

Mobility and human movement: POI infilling, next/goal/terminal prediction, biased and unbiased class discovery (Garg et al., 28 Sep 2025). Example: Geolife dataset performance with modest recall bias in 198-class settings.
Multi-agent and sports scenarios: Prediction, imputation, spatial-temporal recovery unified as masked completion; evaluation via minADE/minFDE/path statistics, and out-of-bounds behaviors (Xu et al., 2024).
Offline reinforcement learning: Acting as a forward/inverse dynamics model, reward predictor, or RCBC policy for D4RL and RoboMimic domains, matching or exceeding state-of-the-art returns (Wu et al., 2023, Wen et al., 2024).
Scientific spatiotemporal data: Chamfer loss-based coordinate infilling, semantic track/shower discrimination from self-supervised representations (Young et al., 4 Feb 2025).
Autonomous driving prediction: Multi-modal, multi-agent trajectory forecasting across environments, explicit collision rate reduction, and improved generalization to new domains (Chen et al., 2023, Park et al., 2024).

Typical evaluation metrics include accuracy, F1, mean displacement errors (ADE, FDE), recall, class-consistency bias, and synthetic statistics matching (JSD, path/step statistics). MTMs consistently outperform specialized networks or prior self-supervised models, especially in scarce data or transfer scenarios (Long et al., 23 Jan 2025, Najjar, 2023, Xu et al., 2024).

6. Extensions, Limitations, and Open Issues

Several research threads extend or interrogate MTM methodology:

Temporal modeling: The lack of explicit time embeddings in some MTMs limits representation of interval-scale dynamics; ongoing work targets continuous/learned time augmentations (Najjar, 2023).
Ethics and privacy: Large-scale trajectory modeling risks re-identification, motivating aggregation, anonymization, and ethical scrutiny (Najjar, 2023).
Adaptive and learned masking: While hand-crafted masking strategies suffice, automated or learnable masking policies may further improve robustness (Chen et al., 2023, Park et al., 2024).
Contextual and multi-modal conditioning: Classifier-free guidance, user-history embedding, and per-actor memory enable contextual generation and adaptation, including personalized or controllable rollouts (Long et al., 23 Jan 2025, Park et al., 2024).
Transfer and continual adaptation: MTMs with test-time adaptation or token memory achieve rapid online adjustment to domain shift or seen/unseen agents (Park et al., 2024).
Scaling and resource efficiency: Tokenization bottlenecks and model size limit application to global-scale or ultra-high-resolution domains; further advances in patching, grouping, or compression are anticipated (Young et al., 4 Feb 2025).

Empirical limitations include failure to distinguish fine-grained or sub-token patterns (e.g., overlapping trajectories), sensitivity to scene or agent duration, and overreliance on fixed masking schedules in some instances (Young et al., 4 Feb 2025, Park et al., 2024).

7. Representative MTM Instances Across Modalities

The following table summarizes representative MTM architectures and their operational domains, as defined in the cited literature:

Model / Paper	Domain	Sequence Type	Core Masking
GPS-MTM (Garg et al., 28 Sep 2025)	Mobility analytics	POI+action sequence	Random per-modality, 15%
GenMove (MTM) (Long et al., 23 Jan 2025)	Mobility, multi-task	Location/graph embeddings	5 mask types, classifier-free
Masked Trajectory Models (Wu et al., 2023)	RL, control	State–action trajectories	Random autoregressive
Traj-MAE (Chen et al., 2023)	Autonomous driving	Multi-agent trajectories	Social, temporal, hybrid
T4P (Park et al., 2024)	Motion forecasting	Agent lanes+scene polylines	40%–30% multi-radius, actor mem.
UniTraj (Xu et al., 2024)	Sports/Multi-agent	$N\times T\times D$ tensors	5 spatiotemporal mask types
PoLAr-MAE (Young et al., 4 Feb 2025)	HEP, 3D tracks	Point clouds, volumetric	60% patch-level mask