Trajectory Tokens: Methods & Applications

Updated 3 April 2026

Trajectory Tokens are discrete or latent symbols representing atomic units or trajectory segments that capture dynamic patterns in behavioral, spatio-temporal, or multimodal data.
They enable efficient modeling by reducing data redundancy and aligning tokens with semantic coherence, benefiting applications in video analysis, game state modeling, and geospatial reasoning.
Their construction leverages methods such as JSON-Bag tokenization, learnable clustering, and adaptive token selection to optimize performance and scalability in complex sequential tasks.

A trajectory token is a discrete or latent symbol representing atomic units, segments, or paths within a sequential behavioral, spatio-temporal, or multi-modal process. The core property of trajectory tokens is their alignment with temporally or semantically coherent “trajectories”—in data ranging from agent behaviors, video sub-objects, spatio-temporal motion, location traces, or reasoning steps. Unlike traditional tokens (e.g., linguistic words or uniform video patches), trajectory tokens are activated, grouped, or learned according to dynamic patterns of motion, semantic structure, or information-theoretic relevance within a trajectory. They are foundational in scalable modeling approaches for games, video understanding, geospatial reasoning, multi-agent simulation, and LLM policy optimization, each with distinct definitions and computational implementations.

1. Canonical Definitions and Taxonomies

Trajectory tokens are variously concretized depending on domain and modality, but all share the following technical aspects:

In game modeling and state logs: A trajectory token is an atomic JSON path-value tuple, such as “.currentAge.2” or “.playerResources[1].Wood.2”, derived via exhaustive tokenization of the hierarchical game state sequence, treating every non-container entry as a token and joining key paths. The full vocabulary is the union of all such potential paths over the dataset, enabling trajectory bag-of-tokens representations (Nguyen et al., 1 Aug 2025).
In video and vision: A trajectory token often represents an entity- or motion-coherent cluster of spatio-temporal features spanning the video. For instance, it may correspond to a panoptic sub-object’s pixel mask evolving over time (TrajViT (Zheng et al., 29 May 2025)), a group of image patches dynamically clustered and aggregated with learnable attention queries (TrajTok (Zheng et al., 26 Feb 2026)), or a semantic-aware tracked point with fused motion and appearance descriptors (Trokens (Kumar et al., 5 Aug 2025), TATs (Kumar et al., 2024)).
In reinforcement learning and LLMs: Each token generated along a model’s output trajectory can be viewed as a trajectory token. Token identities, positions, and contextual reward signals are critical for computing gradients, assigning structure-aware or conflict-aware policy updates, or for enabling efficient training paradigms such as partial-token policy gradients (NAT (Sang et al., 20 Feb 2026), GTPO (Simoni et al., 5 Aug 2025)).
In geospatial and sequential data: Location and time tokens derived via hierarchical lattice encoding (e.g., H3 hexagons, JIS X 0410 grid codes) or discretized time-buckets, fused into trajectory-level sequences for deep learning models, serve as the fundamental symbols by which mobility or activity trajectories are represented (Najjar, 2023, Horikomi et al., 2023, Mbuya et al., 2024).
In multimodal reasoning: Trajectory tokens may be latent (continuous vector) embeddings, optionally interleaved with discrete text, that represent the model’s “mental imagery” at each reasoning step, allowing step-wise multimodal computation (Yang et al., 20 Jun 2025).

Trajectories may be atomic (as in path segments), grouped (object-wise), or synthetic (conditional rollouts in policy optimization).

2. Algorithmic Construction and Representation

Game Trajectory Tokenization

The JSON-Bag approach for games (Nguyen et al., 1 Aug 2025) operates as follows:

Each full play-through is serialized as a JSON array of states.
Every non-container (leaf) property is converted into a token t by its full path from root.
Across all game states in a trajectory, these tokens are collected and counted.
The set of unique tokens across all games forms the vocabulary; per-trajectory frequency vectors are L1-normalized to enable probabilistic and distance-based comparisons.

Video and Visual Tokenization

In prevailing video models:

Patch-based: Uniform grid patchification over space-time (ViT3D, ViViT); scales with O(T·H·W) tokens.
Trajectory-based (TrajViT, TrajTok): Pixels are grouped into spatio-temporal clusters (trajectories) with dynamic, learnable assignment; each trajectory is pooled (e.g., via Perceiver cross-attention) into a trajectory token that represents a persistent object or motion entity, decoupling token count from video length and enabling focus on semantic scene complexity (Zheng et al., 29 May 2025, Zheng et al., 26 Feb 2026).
Point tracking and semantic fusion (Trokens, TATs): Sampled points (often via semantic-aware methods) are tracked through time, with trajectory tokens enriched by appearance descriptors and explicit motion features such as histograms of oriented displacement or pairwise offsets (Kumar et al., 5 Aug 2025, Kumar et al., 2024).
Motion disentanglement (TokenMotion): Spatio-temporal tokens are derived for both human poses and camera trajectories, then fused via dynamic masks and per-block cross-attention to control human-centric video generation (Li et al., 11 Apr 2025).

Language, RL, and Geospatial Modeling

Autoregressive LMs and RL: Each position in an output sequence is a trajectory token. They are subject to token-aware policy-gradient rules, conflict-aware updates, and partial-token selection for training (Sang et al., 20 Feb 2026, Simoni et al., 5 Aug 2025, Shen et al., 15 Jan 2026).
Geospatial check-ins and human mobility: Locations are quantized into hierarchical spatial tokens (e.g., H3, JIS X 0410), sequenced over time and further sub-tokenized by byte-pair or WordPiece algorithms to address vocabulary scaling and robustness (Najjar, 2023, Horikomi et al., 2023).

3. Core Analytical and Learning Methods

Bag-of-Tokens and Information-Theoretic Distances

Tokens are counted per trajectory, forming normalized bag-of-tokens distributions.
Jensen-Shannon distance (JSD), a symmetrized, true metric, is applied to these distributions for similarity and classification tasks.
Class prototypes are computed as average frequency vectors for prototype-based nearest-neighbor search (Nguyen et al., 1 Aug 2025).
JSD distances between class prototypes highly correlate with policy distances between agents, showing functional alignment between token statistics and behavioral diversity.

Trajectory-Based Aggregation and Adaptive Grouping

In video, trajectory tokens are formed by assigning pixels or patches to trajectory segments using learnable attention, segmentation, or tracking outputs.
The number of trajectory tokens is dynamically determined by semantic content (e.g., object count), not spatio-temporal grid dimensions (Zheng et al., 26 Feb 2026).
Matryoshka-style adaptive sub-tokenization allows per-trajectory granularity control.
Trajectory tokens are often refined by mask-constrained cross-attention for disentanglement and fine-grained representation.

Token Selection and Policy Optimization

In RL/Langauge model fine-tuning, not all output tokens in a trajectory need to be updated; unbiased estimators with partial token selection (e.g., via Horvitz-Thompson reweighting) maintain the learning signal while enabling compute/memory reductions (Sang et al., 20 Feb 2026).
Conflict tokens—tokens appearing in both positively and negatively rewarded completions at the same position—are treated specially, with masked/doubled gradients to stabilize sequence-level updates (Simoni et al., 5 Aug 2025).
In distillation, token-level confidence trajectory (how the model’s confidence for each token evolves during training) determines masking strategies that overcome early optimization bottlenecks and improve transfer (Shen et al., 15 Jan 2026).

Motion and Relational Modeling

Trajectory tokens are enriched by concatenating or adding appearance features, intra-trajectory motion histograms (HoD), and explicit inter-trajectory relational embeddings, enabling both local and global motion modeling (Kumar et al., 5 Aug 2025).

Spatial/Rich Label Smoothing

In behavioral and agent modeling, spatially-aware label smoothing replaces uniform label smoothing, allocating higher probabilities to tokens near the ground truth in trajectory (endpoint/yaw) space (Zhang et al., 23 Jun 2025).

4. Empirical Benchmarks and Applications

Trajectory tokens underpin state-of-the-art performance in multiple tasks:

Domain	Implementations/Benchmarks	Reported Effects
Game agent and seed classification	JSON-Bag across TAG games (Nguyen et al., 1 Aug 2025)	Outperforms hand-crafted features, strong sample-efficiency
Video retrieval/classification	TrajViT/TrajTok (Zheng et al., 29 May 2025 Zheng et al., 26 Feb 2026)	6–10× token reduction, +6% retrieval R@5, 4× training speed
Few-shot action recognition	Trokens, TATs (Kumar et al., 5 Aug 2025 Kumar et al., 2024)	Outperforms TATs with relational+motion tokens, robust to point count
RL efficiency	NAT (Sang et al., 20 Feb 2026)	Up to 50% token budget, no loss in Acc@16, 29% training speedup
Agent simulation (autonomous driving)	TrajTok (Zhang et al., 23 Jun 2025)	+0.0038 realism over top-k baselines, robust symmetry
Multimodal reasoning	Mirage (Yang et al., 20 Jun 2025)	+4–9% in spatial reasoning/planning accuracy using latent trajectory tokens
Mobility/trajectory ML	Spatial tokenization (Najjar, 2023, Horikomi et al., 2023)	33% F1 improvement, sub-token scaling, robust feature learning

Applied domains include masked video modeling (TATS (Rai et al., 13 May 2025)), frame interpolation (Liu et al., 2022), online anomaly detection (Mbuya et al., 2024), and advanced LLM decoding (d3LLM (Qian et al., 12 Jan 2026)).

5. Advantages, Limitations, and Comparative Insights

Advantages

Token-Count Efficiency: Decoupling token count from video/frame count or trajectory length, thereby scaling with semantic content rather than input redundancy (Zheng et al., 29 May 2025, Zheng et al., 26 Feb 2026).
Semantic and Structural Fidelity: Align tokens to actual entities, sub-objects, or motion primitives, improving interpretability and information representation (Kumar et al., 5 Aug 2025 Zheng et al., 29 May 2025).
Sample Efficiency and Adaptivity: Prototype- or relation-based methods yield strong few-shot and cross-domain transfer gains (Nguyen et al., 1 Aug 2025 Kumar et al., 5 Aug 2025).
Plug-and-Play Integrations: Trajectory tokenizers serve as drop-in modules—from policy learning (Simoni et al., 5 Aug 2025, Sang et al., 20 Feb 2026), to behavior generation (Zhang et al., 23 Jun 2025), and foundation models (Najjar, 2023).

Limitations and Challenges

Tokenization Complexity and Consistency: Many methods rely on external tracking, segmentation, or rule-based procedures, which may be slow or non-end-to-end (Zheng et al., 29 May 2025). Recent work addresses this via fully differentiable segmentation (Zheng et al., 26 Feb 2026).
Sensitivity to Hyperparameters: Dynamic token allocation, mask thresholds, token vocabulary size, and mask ratio require careful tuning for optimal efficiency and accuracy.
Domain-Generalization: Spatial or semantic coverage may still be imperfect in new environments, requiring adaptation of hierarchical coding or clustering schemes (Najjar, 2023, Zhang et al., 23 Jun 2025).
Interpretability Tradeoffs: Latent trajectory tokens (e.g., in Mirage) may not correspond to interpretable visual concepts, although they empirically align with spatial reasoning (Yang et al., 20 Jun 2025).

6. Theoretical and Implementation Frameworks

Several formal and pseudo-code frameworks are specified:

Bag-of-tokens JSD Computation: Explicit L1 normalization, followed by JSD metric computation for distance-based classification (Nguyen et al., 1 Aug 2025).
Learnable Perceiver-based Clustering: Unsupervised attention or Perceiver modules learn soft segmentations over spatio-temporal features to produce hard/soft trajectory masks (Zheng et al., 26 Feb 2026).
Policy Gradient with Horvitz-Thompson Correction: Subsampling with inclusion probabilities per token, producing unbiased but higher-variance RL signals (Sang et al., 20 Feb 2026).
Sample/Pseudo-label-based RL for Token Samplers: Joint PPO and reconstruction objectives for adaptive masking in masked video models, with alternating train/freeze schedules (Rai et al., 13 May 2025).
Spatially-Aware Smoothing in Label Space: Probability mass for near-misses is distributed proportional to spatial proximity, enhancing model tolerance to minor prediction errors (Zhang et al., 23 Jun 2025).

7. Outlook and Open Directions

Research on trajectory tokens continues to develop in the following directions:

End-to-end and Differentiable Tokenization: Joint training of tokenizers and downstream models to adapt trajectory decomposition to task-specific loss functions (Zheng et al., 26 Feb 2026).
Semantics-Driven and Adaptive Tokenization: Dynamic allocation and sub-tokenization based on downstream supervision, semantic richness, or entity-level priors (Kumar et al., 5 Aug 2025, Zheng et al., 26 Feb 2026).
Hybrid Discrete-Continuous Representations: Use of both discrete (e.g., spatial hashes, linguistic tokens) and continuous (e.g., latent visual embeddings) tokens within the same trajectory, supporting multimodal and joint reasoning (Yang et al., 20 Jun 2025).
RL and Selective Update Paradigms: Token-selective objectives in RL (GTPO, NAT; (Simoni et al., 5 Aug 2025, Sang et al., 20 Feb 2026)) provide scalable training for long-sequence LLMs and sequential decision processes.
Foundation Models for Spatio-Temporal Data: Universal tokenization frameworks for large-scale, real-world sequential data (mobility, event logs), enabling generalizable pre-training and fine-tuning (Najjar, 2023).
Privacy and Interpretability: Ensuring trajectory tokenization preserves privacy and supports introspection, particularly in human behavioral modeling contexts.

Trajectory tokens thus constitute a pervasive, foundational primitive in current machine learning—enabling efficient, scalable, and semantically coherent modeling of complex sequential, spatio-temporal, and agent-based processes.