Trajectory-Based Grounded Tokenization
- Trajectory-based grounded tokenization is a method that discretizes continuous spatiotemporal data into tokens aligned with inherent trajectories, preserving semantic details and reducing redundancy.
- It employs multi-scale, hierarchical, and rule-based schemes to capture local and contextual dynamics, enhancing prediction and reasoning in diverse applications.
- Empirical results demonstrate significant vocabulary reduction and efficiency gains, evidenced by lower computational demands and improved model accuracy across domains.
Trajectory-based grounded tokenization is a paradigm for discretizing and representing continuous spatiotemporal or action sequences—such as mobility trajectories, agent motions, or video entity tracks—using tokens that directly encode trajectory semantics, multi-scale context, and task-relevant grounding. This approach contrasts with conventional patch-based, uniformly gridded, or purely data-driven discretizations by explicitly aligning token structure with the underlying entities or dynamics in the data. It enables highly compact yet expressive sequence modeling, enhances data efficiency for large-scale learning, and supports precise control, reasoning, and prediction across a range of domains including human mobility, autonomous driving, robotics, simulation, and video understanding.
1. Conceptual Foundations and Motivations
The motivation for trajectory-based grounded tokenization arises from limitations of naïve gridding or patchification in complex spatiotemporal domains. Uniform spatial grids or fixed step-size discretization can explode the vocabulary size, fail to encode semantic continuity, and introduce redundancy or artifacts, especially for fine-grained tasks or large geographies (Park et al., 2023, Najjar, 2023). Conversely, grounding tokens on trajectories—persistent entity movements, agent actions, or object tracks—naturally aligns the discrete representation with domain structure, suppressing redundancy and preserving fine-grained dynamics.
Grounded tokenization generally aims to achieve:
- Compression: Drastic reduction in sequence and vocabulary size versus patch/location-based discretization, without loss of semantic information (Park et al., 2023, Zheng et al., 29 May 2025, Liu et al., 4 Feb 2026).
- Expressiveness: Accurate encoding of local and contextual dynamics for downstream sequence modeling (prediction, classification, generation).
- Grounding: Alignment of tokens with physically, semantically, or causally meaningful units (e.g., grid cells, agent states, object identities, or trajectories) rather than arbitrary quantization artifacts.
- Efficiency: Support for inference/reasoning with state-of-the-art models at tractable compute and memory budgets.
2. Representative Schemes and Their Mechanisms
Several canonical approaches to trajectory-based grounded tokenization have emerged, with characteristic differences in methodology, domain, and token structure.
2.1 Hierarchical and Multi-Scale Spatial Tokenization
The "Geo-Tokenizer" (Park et al., 2023) partitions spatial data at multiple scales , mapping each location to a tuple of grid-cell indices at each scale. Strict nesting (coarse cells aggregate fine cells) dramatically reduces vocabulary size: for example, discretizing at 100 km/1 km/100 m yields 6,740 tokens versus 80,000 for a flat 100 m grid. Sequences of multi-scale indices are contextually encoded by masked Transformers; downstream tasks (next-location, land use, transport mode) benefit both from the expressiveness and manageability of the token set.
2.2 Data-Driven and Rule-Based Discretization for Agent Motion
In simulation and behavior prediction, frameworks such as TrajTok (Zhang et al., 23 Jun 2025) and k-disks (Philion et al., 2023) convert short fixed-horizon trajectories or action transitions into tokens by clustering in endpoint or motion space. Rule-based gridding (with coverage/robustness heuristics) and data-driven averaging/expansion ensure semantic grounding and avoid grid artifacts. Tokens are typically short motion sequences or local increments. Distinct agent types (vehicles, bicyclists, pedestrians) each have separate vocabularies.
2.3 Object and Trajectory-Based Tokens in Video and Multimodal Models
For video, TrajViT (Zheng et al., 29 May 2025) and Trokens (Kumar et al., 5 Aug 2025) replace traditional patch tokens with tokens derived from panoptic object or part trajectories. Each token aggregates features along the object trajectory (appearance and motion) and can encode intra- and inter-trajectory dynamics, sampled with semantic or saliency priors. In TGT (Zhang et al., 16 Oct 2025), each trajectory is paired with local text grounding and directly controls spatially-precise video generation via cross-attention.
2.4 Temporal/Segmental Grounding for Long-Term Sequences
In human mobility prediction, RHYTHM (He et al., 18 Jul 2025, He et al., 27 Sep 2025) segments long historical trajectories into semantically meaningful daily or weekly blocks. Each block is tokenized via intra-segment attention and semantic-prompt embeddings, reducing input length and enabling hierarchical modeling of periodicity.
2.5 Action and Policy Tokenization in Robotics
OAT (Liu et al., 4 Feb 2026) introduces a learned tokenization that compresses continuous action chunks into ordered sequences of finite-quantized, register-based tokens. The design provides high compression, full decodability, and strict left-to-right causal order, making tokens directly compatible with autoregressive policy learning and prefix-based control.
3. Model Integration and Training Methodologies
Grounded trajectory tokens are typically integrated as follows:
- Encoder Preprocessing: Raw input sequences (coordinates, images, or actions) are converted via hierarchical gridding, clustering, semantic sampling, or motion extraction to a compact sequence of tokens.
- Embedding and Fusion: Each token is mapped to an embedding via learned lookup tables, fusion of multi-scale features, or an adapter projecting to a LLM or multimodal embedding space.
- Hierarchical/Contextual Modeling: Specialized architectures—masked/causal Transformers, cross-attention with contextual inputs, or hierarchical attention stacks—consume token sequences. For example, the Hierarchical Auto-regressive Location Model (HALM) successively predicts scale-wise components via feedforward heads (Park et al., 2023).
- Loss Functions and Robustness: Training often involves cross-entropy or contrastive losses. Extensions include spatial label smoothing proportional to trajectory space error (Zhang et al., 23 Jun 2025), masked modeling (Najjar, 2023), or contrastive InfoNCE objectives (Zheng et al., 29 May 2025).
- End-to-End or Hybrid Pipelines: Grounded tokenizers may be frozen (pretrained) (Tian et al., 2024) or co-trained with downstream heads. Some frameworks fuse token embeddings with semantic prompts (offline LLM encodings) (He et al., 18 Jul 2025).
4. Empirical Performance, Trade-offs, and Ablations
Across domains, trajectory-based grounded tokenization consistently yields:
| Model/Paper | Vocabulary/Comp. | Downstream Gains | Efficiency Gains |
|---|---|---|---|
| Geo-Tokenizer (Park et al., 2023) | 6.7k tokens (vs. 80k) | Next-loc. pred. +7%@5, land use, mode class. | –7× params, –2.5× FLOPs |
| TrajTok (Zhang et al., 23 Jun 2025) | ≈8k tokens/vehicle | Realism score +0.0038, best coverage | >99.5% coverage, plug-and-play |
| TrajViT (Zheng et al., 29 May 2025) | 10× token reduction | +6% retrieval, +5.2% QA, –18× FLOPs | Training 4× faster |
| RHYTHM (He et al., 18 Jul 2025) | Seq.-len. 384→55 | Acc@1 +2.4pp (overall), +5pp (weekends) | –24.6% train time, frozen LLM |
| OAT (Liu et al., 4 Feb 2026) | 28× compressed vs. float | Success +10–50% vs. prior tokenizers | Prefix-based "anytime" inference |
Ablations and analysis identify several empirical findings:
- Multi-scale or hierarchical sharing is critical for vocabulary reduction and expressivity (Park et al., 2023).
- Semantic and temporal grounding (via prompts or object category) enhances model interpretability and generalization (He et al., 18 Jul 2025, Zheng et al., 29 May 2025).
- Trajectory-relational and intra-segment descriptors (e.g., HoD in Trokens) enable few-shot action recognition (Kumar et al., 5 Aug 2025).
- Spatial label smoothing regularizes token prediction and improves generalization in highly discretized motion spaces (Zhang et al., 23 Jun 2025).
- Order-inducing dropout or attention enforces compatibility with next-token models and enables flexible reasoning (Liu et al., 4 Feb 2026).
5. Limitations and Open Challenges
While trajectory-based grounded tokenizers provide substantial benefits, several limitations remain:
- Many schemes require hand-tuned resolution or clustering heuristics (e.g., cell size, merge thresholds), with limited end-to-end differentiability (Najjar, 2023, Park et al., 2023).
- Rare or outlier behaviors can challenge fixed token vocabularies; hybrid or adaptive discretizations are underexplored (Zheng et al., 29 May 2025, Tian et al., 2024).
- Some pipelines rely on frozen upstream modules (e.g., scene tokenizers, LLMs): missed detections or misalignments cannot be remedied downstream (Tian et al., 2024).
- The integration of time beyond coarse clustering, or of rich interaction structures (multi-agent, multi-modal), remains an open extension (Najjar, 2023, Zheng et al., 29 May 2025, Zhang et al., 16 Oct 2025).
- Scalability to even longer sequences and more complex or multi-resolution behaviors may require further innovations in hierarchical or adaptive tokenization.
Future work aims to address these by exploring learned, end-to-end quantization (Najjar, 2023), dynamic and fine-grained multi-scale hierarchies (He et al., 27 Sep 2025), spatially grounded reasoning in multi-modal models (Tian et al., 2024), and the fusion of discrete and continuous control (Liu et al., 4 Feb 2026).
6. Extensions and Domain-Specific Specializations
Trajectory-based grounded tokenization has been effectively specialized for:
- Multi-agent simulation and prediction: Traffic agents, pedestrians, and cyclists modeled via k-disk, TrajTok, or contextually grounded tokens, optimized for interaction modeling (Philion et al., 2023, Zhang et al., 23 Jun 2025).
- Robotic policy generation: OAT tokens structure control as discrete, fully decodable, causally ordered sequences, supporting "anytime" planning and hierarchical decomposition (Liu et al., 4 Feb 2026).
- Video understanding and generation: Video transformer pipelines operate on sub-object trajectory tokens (TrajViT), semantic-aware action tracks (Trokens), or text-grounded visual trajectories (TGT) for controllable, interpretable motion prediction and video QA (Zheng et al., 29 May 2025, Kumar et al., 5 Aug 2025, Zhang et al., 16 Oct 2025).
- Large-scale trajectory intelligence: Foundational models for user mobility and activity use spatial-temporal clustering and hierarchical discrete tokenization (e.g., H3 + sub-hash + WordPiece) to enable transformer-based sequence learning over population-scale datasets (Najjar, 2023, He et al., 18 Jul 2025).
These domain specializations leverage grounding to objects, motion, semantics, or context, collectively demonstrating that trajectory-based tokenization is a critical enabler for scalable, fine-grained, and interpretable modeling across spatiotemporal machine learning.
Key references: (Park et al., 2023, Philion et al., 2023, Zhang et al., 23 Jun 2025, Zheng et al., 29 May 2025, He et al., 18 Jul 2025, He et al., 27 Sep 2025, Najjar, 2023, Liu et al., 4 Feb 2026, Kumar et al., 5 Aug 2025, Zhang et al., 16 Oct 2025, Tian et al., 2024).