Papers
Topics
Authors
Recent
Search
2000 character limit reached

Masked Trajectory Model (MTM) Overview

Updated 1 July 2026
  • Masked Trajectory Model (MTM) is a self-supervised approach that reconstructs missing trajectory segments using diverse masking strategies.
  • It leverages bidirectional transformers, masked autoencoders, and diffusion networks to enable prediction, infilling, and multi-modal representation learning across varied domains.
  • Flexible masking patterns such as random, block, and agent-based masking empower MTMs to handle tasks like next-step prediction, imputation, and multi-agent interactions effectively.

A Masked Trajectory Model (MTM) refers to a self-supervised or generative model that reconstructs missing segments of a trajectory sequence—whether states, actions, sensor data, or semantic observations—from the remaining observed portions. MTMs generalize the “masked LLM” paradigm to spatial, spatiotemporal, and multimodal trajectory domains, enabling flexible prediction, infilling, simulation, and representation learning. Modern MTM architectures leverage bidirectional transformers, masked autoencoders, or diffusion networks, and support task versatility with adjustable masking schemes. MTMs have demonstrated strong empirical performance in domains including mobility analytics, decision-making, multi-agent sports, navigation, and scientific trajectory analysis (Garg et al., 28 Sep 2025, Long et al., 23 Jan 2025, Wu et al., 2023).

1. Core Modeling Principles and Mathematical Foundations

An MTM operates by mapping an input trajectory—represented as a token or feature sequence x=(x1,,xL)x = (x_1,\ldots,x_L)—to an output fθ(x)f_\theta(x) that fills in masked (unobserved) positions M{1,,L}\mathcal{M} \subset \{1,\dots,L\}. The objective is to model the conditional distribution Pθ(xMxM)P_\theta(x_\mathcal{M} \mid x_{\setminus \mathcal{M}}), enabling the model to reconstruct or generate the missing elements from partial context.

Typical objective functions for masked modeling include:

  • Cross-entropy for discrete token infilling:

LMTM=1MiMlogPθ(xixM)\mathcal{L}_{\text{MTM}} = -\frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}} \log P_\theta(x_i\mid x_{\setminus\mathcal{M}})

as in trajectory token models (Najjar, 2023).

  • Mean squared error (MSE) for continuous-valued elements:

Lmask=iMxix^i2\mathcal{L}_{\text{mask}} = \sum_{i\in\mathcal{M}} \|x_i - \hat{x}_i\|^2

as in both mobility dynamics (Wu et al., 2023) and scientific 3D trajectory applications (Young et al., 4 Feb 2025).

L(θ)=Ex0,ϵ,t,mϵϵθ(αtx0+1αtϵ,t,cmask)2\mathcal{L}(\theta) = \mathbb{E}_{x_0, \epsilon, t, m} \| \epsilon - \epsilon_\theta(\sqrt{\alpha_t}x_0 + \sqrt{1-\alpha_t}\epsilon, t, c_{\text{mask}}) \|^2

where cmaskc_{\text{mask}} encodes observed/masked pattern and context, as in masked conditional diffusion frameworks (Long et al., 23 Jan 2025).

Key to MTM is the stochastic application of diverse masking patterns (random, block, terminal, circadian, agentwise), which supports a variety of downstream inference queries—such as next-step prediction, history infilling, and imputation under arbitrary missingness (Long et al., 23 Jan 2025, Xu et al., 2024).

2. Architectural Mechanisms and Trajectory Tokenization

MTMs typically comprise the following architectural elements:

  • Input tokenization: Depending on the domain and task, raw trajectories (Rn\mathbb{R}^n, categorical states, semantic views, or spatial points) are discretized or encoded via:
  • Embedding modules: Learnable or MLP-based layers lift raw tokens or features into a shared latent space, augmented with:
    • Modality embeddings
    • Position or time-step encodings (sinusoidal, learned, or both)
    • Agent- or class-specific tokens (Park et al., 2024)
  • Mask token injection: Masked positions are replaced by a learned [MASK] vector or special token.
  • Sequence processing backbone: Most commonly a stack of bidirectional transformer layers (self-attention, feed-forward), or in some cases masked autoencoders (MAE), state-space models (e.g., bidirectional temporal Mamba), or conditional diffusion networks (Garg et al., 28 Sep 2025, Long et al., 23 Jan 2025, Chen et al., 2023).
  • Prediction/filling heads: Each output position routes to a per-modality head (e.g., softmax for categorical, regression for continuous, or decoder for point coordinates), as required by the data modality.

MTMs universally support cross-modal or cross-agent attention, so as to model regularities such as “patterns of normalcy” in mobility, agent–agent interactions in driving or sports, or semantic progression in navigation trajectories (Garg et al., 28 Sep 2025, Li et al., 2023).

3. Masking Strategies and Task Versatility

A hallmark of the MTM paradigm is its reliance on flexible masking strategies, which unify multiple prediction, imputation, and recovery tasks:

A single pretrained MTM supports, by simple change of mask pattern at inference, use as:

Task/Role Mask pattern Functionality
Forward dynamics mask next states, observe history State transition prediction
Inverse dynamics mask actions, observe consecutive states Action inference
Behavior cloning mask only actions given prior states/actions Policy emulation
Goal/reward prediction mask rewards, returns Value estimation
Imputation/in-filling mask arbitrary subset Arbitrary missing data recovery

This masking flexibility underpins task conversion and multitask learning without retraining (Wu et al., 2023, Wen et al., 2024).

4. Implementation and Optimization Details

Training details, hyperparameter selection, and model capacity directly affect MTM performance:

  • Network depth and width: Empirical findings indicate diminishing returns past moderate-depth transformers (e.g., 4–12 layers, 128–768 dims), with 4–6 heads per layer typical (Garg et al., 28 Sep 2025, Najjar, 2023).
  • Batching and masking ratio: Batch sizes of 16–64, with 15–60% masking per batch, are common; extreme mask ratios (>60–80%) degrade performance, while moderate ratios (~15–50%) improve generalization (Chen et al., 2023).
  • Optimization: Adam(W) optimizer with learning rates in [1e-4, 1e-3] and moderate weight decay; training for tens of thousands up to hundreds of thousands of steps depending on model and data scale (Garg et al., 28 Sep 2025, Najjar, 2023, Young et al., 4 Feb 2025).
  • Augmentation and regularization: Dropout and stochastic depth in transformers, auxiliary reconstruction losses, and, in some designs, continual pre-training protocols to address catastrophic forgetting under multi-strategy masking (Chen et al., 2023).
  • Tokenization scalability: Techniques like sub-hash vocabularies, patch/voxel grouping, or agent-wise aggregation are necessary for large/varying vocabularies or extreme data sparsity (Najjar, 2023, Young et al., 4 Feb 2025).

Dataset-dependent choices—e.g., group radius for particle events, clustering for spatial hexagons, or agent-specificity in multi-agent settings—are made to ensure efficient and robust sequence processing.

5. Downstream Tasks, Evaluation, and Empirical Performance

MTMs, pretrained on large-scale or diverse trajectories, are evaluated and fine-tuned on a variety of tasks without retraining architectural weights:

  • Mobility and human movement: POI infilling, next/goal/terminal prediction, biased and unbiased class discovery (Garg et al., 28 Sep 2025). Example: Geolife dataset performance with modest recall bias in 198-class settings.
  • Multi-agent and sports scenarios: Prediction, imputation, spatial-temporal recovery unified as masked completion; evaluation via minADE/minFDE/path statistics, and out-of-bounds behaviors (Xu et al., 2024).
  • Offline reinforcement learning: Acting as a forward/inverse dynamics model, reward predictor, or RCBC policy for D4RL and RoboMimic domains, matching or exceeding state-of-the-art returns (Wu et al., 2023, Wen et al., 2024).
  • Scientific spatiotemporal data: Chamfer loss-based coordinate infilling, semantic track/shower discrimination from self-supervised representations (Young et al., 4 Feb 2025).
  • Autonomous driving prediction: Multi-modal, multi-agent trajectory forecasting across environments, explicit collision rate reduction, and improved generalization to new domains (Chen et al., 2023, Park et al., 2024).

Typical evaluation metrics include accuracy, F1, mean displacement errors (ADE, FDE), recall, class-consistency bias, and synthetic statistics matching (JSD, path/step statistics). MTMs consistently outperform specialized networks or prior self-supervised models, especially in scarce data or transfer scenarios (Long et al., 23 Jan 2025, Najjar, 2023, Xu et al., 2024).

6. Extensions, Limitations, and Open Issues

Several research threads extend or interrogate MTM methodology:

  • Temporal modeling: The lack of explicit time embeddings in some MTMs limits representation of interval-scale dynamics; ongoing work targets continuous/learned time augmentations (Najjar, 2023).
  • Ethics and privacy: Large-scale trajectory modeling risks re-identification, motivating aggregation, anonymization, and ethical scrutiny (Najjar, 2023).
  • Adaptive and learned masking: While hand-crafted masking strategies suffice, automated or learnable masking policies may further improve robustness (Chen et al., 2023, Park et al., 2024).
  • Contextual and multi-modal conditioning: Classifier-free guidance, user-history embedding, and per-actor memory enable contextual generation and adaptation, including personalized or controllable rollouts (Long et al., 23 Jan 2025, Park et al., 2024).
  • Transfer and continual adaptation: MTMs with test-time adaptation or token memory achieve rapid online adjustment to domain shift or seen/unseen agents (Park et al., 2024).
  • Scaling and resource efficiency: Tokenization bottlenecks and model size limit application to global-scale or ultra-high-resolution domains; further advances in patching, grouping, or compression are anticipated (Young et al., 4 Feb 2025).

Empirical limitations include failure to distinguish fine-grained or sub-token patterns (e.g., overlapping trajectories), sensitivity to scene or agent duration, and overreliance on fixed masking schedules in some instances (Young et al., 4 Feb 2025, Park et al., 2024).

7. Representative MTM Instances Across Modalities

The following table summarizes representative MTM architectures and their operational domains, as defined in the cited literature:

Model / Paper Domain Sequence Type Core Masking
GPS-MTM (Garg et al., 28 Sep 2025) Mobility analytics POI+action sequence Random per-modality, 15%
GenMove (MTM) (Long et al., 23 Jan 2025) Mobility, multi-task Location/graph embeddings 5 mask types, classifier-free
Masked Trajectory Models (Wu et al., 2023) RL, control State–action trajectories Random autoregressive
Traj-MAE (Chen et al., 2023) Autonomous driving Multi-agent trajectories Social, temporal, hybrid
T4P (Park et al., 2024) Motion forecasting Agent lanes+scene polylines 40%–30% multi-radius, actor mem.
UniTraj (Xu et al., 2024) Sports/Multi-agent N×T×DN\times T\times D tensors 5 spatiotemporal mask types
PoLAr-MAE (Young et al., 4 Feb 2025) HEP, 3D tracks Point clouds, volumetric 60% patch-level mask

These instances highlight the cross-domain generality, architectural flexibility, and empirical efficacy of the masked trajectory model framework.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Masked Trajectory Model (MTM).