Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trajectory-Based Grounded Tokenization

Updated 13 February 2026
  • Trajectory-based grounded tokenization is a method that discretizes continuous spatiotemporal data into tokens aligned with inherent trajectories, preserving semantic details and reducing redundancy.
  • It employs multi-scale, hierarchical, and rule-based schemes to capture local and contextual dynamics, enhancing prediction and reasoning in diverse applications.
  • Empirical results demonstrate significant vocabulary reduction and efficiency gains, evidenced by lower computational demands and improved model accuracy across domains.

Trajectory-based grounded tokenization is a paradigm for discretizing and representing continuous spatiotemporal or action sequences—such as mobility trajectories, agent motions, or video entity tracks—using tokens that directly encode trajectory semantics, multi-scale context, and task-relevant grounding. This approach contrasts with conventional patch-based, uniformly gridded, or purely data-driven discretizations by explicitly aligning token structure with the underlying entities or dynamics in the data. It enables highly compact yet expressive sequence modeling, enhances data efficiency for large-scale learning, and supports precise control, reasoning, and prediction across a range of domains including human mobility, autonomous driving, robotics, simulation, and video understanding.

1. Conceptual Foundations and Motivations

The motivation for trajectory-based grounded tokenization arises from limitations of naïve gridding or patchification in complex spatiotemporal domains. Uniform spatial grids or fixed step-size discretization can explode the vocabulary size, fail to encode semantic continuity, and introduce redundancy or artifacts, especially for fine-grained tasks or large geographies (Park et al., 2023, Najjar, 2023). Conversely, grounding tokens on trajectories—persistent entity movements, agent actions, or object tracks—naturally aligns the discrete representation with domain structure, suppressing redundancy and preserving fine-grained dynamics.

Grounded tokenization generally aims to achieve:

  • Compression: Drastic reduction in sequence and vocabulary size versus patch/location-based discretization, without loss of semantic information (Park et al., 2023, Zheng et al., 29 May 2025, Liu et al., 4 Feb 2026).
  • Expressiveness: Accurate encoding of local and contextual dynamics for downstream sequence modeling (prediction, classification, generation).
  • Grounding: Alignment of tokens with physically, semantically, or causally meaningful units (e.g., grid cells, agent states, object identities, or trajectories) rather than arbitrary quantization artifacts.
  • Efficiency: Support for inference/reasoning with state-of-the-art models at tractable compute and memory budgets.

2. Representative Schemes and Their Mechanisms

Several canonical approaches to trajectory-based grounded tokenization have emerged, with characteristic differences in methodology, domain, and token structure.

2.1 Hierarchical and Multi-Scale Spatial Tokenization

The "Geo-Tokenizer" (Park et al., 2023) partitions spatial data at multiple scales s=1,…,Ss=1,\dots,S, mapping each location to a tuple (lt1,…,ltS)(l_t^1,\dots,l_t^S) of grid-cell indices at each scale. Strict nesting (coarse cells aggregate fine cells) dramatically reduces vocabulary size: for example, discretizing at 100 km/1 km/100 m yields ∼\sim6,740 tokens versus ∼\sim80,000 for a flat 100 m grid. Sequences of multi-scale indices are contextually encoded by masked Transformers; downstream tasks (next-location, land use, transport mode) benefit both from the expressiveness and manageability of the token set.

2.2 Data-Driven and Rule-Based Discretization for Agent Motion

In simulation and behavior prediction, frameworks such as TrajTok (Zhang et al., 23 Jun 2025) and k-disks (Philion et al., 2023) convert short fixed-horizon trajectories or action transitions into tokens by clustering in endpoint or motion space. Rule-based gridding (with coverage/robustness heuristics) and data-driven averaging/expansion ensure semantic grounding and avoid grid artifacts. Tokens are typically short motion sequences or local increments. Distinct agent types (vehicles, bicyclists, pedestrians) each have separate vocabularies.

2.3 Object and Trajectory-Based Tokens in Video and Multimodal Models

For video, TrajViT (Zheng et al., 29 May 2025) and Trokens (Kumar et al., 5 Aug 2025) replace traditional patch tokens with tokens derived from panoptic object or part trajectories. Each token aggregates features along the object trajectory (appearance and motion) and can encode intra- and inter-trajectory dynamics, sampled with semantic or saliency priors. In TGT (Zhang et al., 16 Oct 2025), each trajectory is paired with local text grounding and directly controls spatially-precise video generation via cross-attention.

2.4 Temporal/Segmental Grounding for Long-Term Sequences

In human mobility prediction, RHYTHM (He et al., 18 Jul 2025, He et al., 27 Sep 2025) segments long historical trajectories into semantically meaningful daily or weekly blocks. Each block is tokenized via intra-segment attention and semantic-prompt embeddings, reducing input length and enabling hierarchical modeling of periodicity.

2.5 Action and Policy Tokenization in Robotics

OAT (Liu et al., 4 Feb 2026) introduces a learned tokenization that compresses continuous action chunks into ordered sequences of finite-quantized, register-based tokens. The design provides high compression, full decodability, and strict left-to-right causal order, making tokens directly compatible with autoregressive policy learning and prefix-based control.

3. Model Integration and Training Methodologies

Grounded trajectory tokens are typically integrated as follows:

  • Encoder Preprocessing: Raw input sequences (coordinates, images, or actions) are converted via hierarchical gridding, clustering, semantic sampling, or motion extraction to a compact sequence of tokens.
  • Embedding and Fusion: Each token ltl_t is mapped to an embedding via learned lookup tables, fusion of multi-scale features, or an adapter projecting to a LLM or multimodal embedding space.
  • Hierarchical/Contextual Modeling: Specialized architectures—masked/causal Transformers, cross-attention with contextual inputs, or hierarchical attention stacks—consume token sequences. For example, the Hierarchical Auto-regressive Location Model (HALM) successively predicts scale-wise components via feedforward heads (Park et al., 2023).
  • Loss Functions and Robustness: Training often involves cross-entropy or contrastive losses. Extensions include spatial label smoothing proportional to trajectory space error (Zhang et al., 23 Jun 2025), masked modeling (Najjar, 2023), or contrastive InfoNCE objectives (Zheng et al., 29 May 2025).
  • End-to-End or Hybrid Pipelines: Grounded tokenizers may be frozen (pretrained) (Tian et al., 2024) or co-trained with downstream heads. Some frameworks fuse token embeddings with semantic prompts (offline LLM encodings) (He et al., 18 Jul 2025).

4. Empirical Performance, Trade-offs, and Ablations

Across domains, trajectory-based grounded tokenization consistently yields:

Model/Paper Vocabulary/Comp. Downstream Gains Efficiency Gains
Geo-Tokenizer (Park et al., 2023) 6.7k tokens (vs. 80k) Next-loc. pred. +7%@5, land use, mode class. –7× params, –2.5× FLOPs
TrajTok (Zhang et al., 23 Jun 2025) ≈8k tokens/vehicle Realism score +0.0038, best coverage >99.5% coverage, plug-and-play
TrajViT (Zheng et al., 29 May 2025) 10× token reduction +6% retrieval, +5.2% QA, –18× FLOPs Training 4× faster
RHYTHM (He et al., 18 Jul 2025) Seq.-len. 384→55 Acc@1 +2.4pp (overall), +5pp (weekends) –24.6% train time, frozen LLM
OAT (Liu et al., 4 Feb 2026) 28× compressed vs. float Success +10–50% vs. prior tokenizers Prefix-based "anytime" inference

Ablations and analysis identify several empirical findings:

5. Limitations and Open Challenges

While trajectory-based grounded tokenizers provide substantial benefits, several limitations remain:

  • Many schemes require hand-tuned resolution or clustering heuristics (e.g., cell size, merge thresholds), with limited end-to-end differentiability (Najjar, 2023, Park et al., 2023).
  • Rare or outlier behaviors can challenge fixed token vocabularies; hybrid or adaptive discretizations are underexplored (Zheng et al., 29 May 2025, Tian et al., 2024).
  • Some pipelines rely on frozen upstream modules (e.g., scene tokenizers, LLMs): missed detections or misalignments cannot be remedied downstream (Tian et al., 2024).
  • The integration of time beyond coarse clustering, or of rich interaction structures (multi-agent, multi-modal), remains an open extension (Najjar, 2023, Zheng et al., 29 May 2025, Zhang et al., 16 Oct 2025).
  • Scalability to even longer sequences and more complex or multi-resolution behaviors may require further innovations in hierarchical or adaptive tokenization.

Future work aims to address these by exploring learned, end-to-end quantization (Najjar, 2023), dynamic and fine-grained multi-scale hierarchies (He et al., 27 Sep 2025), spatially grounded reasoning in multi-modal models (Tian et al., 2024), and the fusion of discrete and continuous control (Liu et al., 4 Feb 2026).

6. Extensions and Domain-Specific Specializations

Trajectory-based grounded tokenization has been effectively specialized for:

These domain specializations leverage grounding to objects, motion, semantics, or context, collectively demonstrating that trajectory-based tokenization is a critical enabler for scalable, fine-grained, and interpretable modeling across spatiotemporal machine learning.


Key references: (Park et al., 2023, Philion et al., 2023, Zhang et al., 23 Jun 2025, Zheng et al., 29 May 2025, He et al., 18 Jul 2025, He et al., 27 Sep 2025, Najjar, 2023, Liu et al., 4 Feb 2026, Kumar et al., 5 Aug 2025, Zhang et al., 16 Oct 2025, Tian et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trajectory-based Grounded Tokenization.