HiT-JEPA: Hierarchical Urban Trajectory Embeddings
- HiT-JEPA is a self-supervised framework for urban trajectory representation that employs a three-level hierarchy to capture fine-grained movements and global patterns.
- It leverages joint embedding predictive objectives along with VICReg regularization, processing point-level, sub-trajectory, and global abstractions for robust similarity measurement.
- The multi-scale loss integration and resilient architecture design yield improved retrieval and regression performance over conventional single-scale methods.
HiT-JEPA (Hierarchical Interactions of Trajectory Semantics via a Joint Embedding Predictive Architecture) is a self-supervised framework for learning hierarchical, multi-scale representations of urban trajectory data. Designed to address the challenges associated with capturing both fine-grained and high-level semantic information from sequential GPS data, HiT-JEPA employs a three-level architecture that explicitly models pointwise, segment-level, and global abstractions. The method leverages joint embedding predictive objectives across these multiple semantic levels to facilitate robust similarity computation and generalization across domains (Li et al., 17 Jun 2025).
1. Formalization and Problem Scope
The primary objective of HiT-JEPA is trajectory similarity computation. Given a set of trajectories, where each trajectory is an ordered sequence of GPS points (, ), each point is mapped to a pre-trained region embedding via a hex-grid index . The framework learns a function which embeds the trajectory such that the similarity between any two trajectories in the learned space approximates a heuristic similarity , often instantiated as a Fréchet or edit-distance-based measure. The approach targets simultaneous fidelity to local transitions and long-term dependency modeling.
2. Three-Level Semantic Hierarchy
HiT-JEPA introduces a three-layer sequential hierarchy, each capturing trajectory information at a different scale:
- Point-Level (Layer 1): Processes inputs , with , modeling local micro-movements such as turns or stops.
- Sub-Trajectory (Layer 2): Generated via , with . Encodes mesoscopic patterns (e.g., short trajectory segments).
- Global Abstraction (Layer 3): Built by again applying convolution and max-pooling: , with . This yields coarse representations of the complete route.
This hierarchical construction enables joint modeling of both local transitions and holistic movement routines, facilitating a multi-resolution understanding of trajectories.
3. Layer-Wise Encoding and Predictive Objectives
At each level , paired context and target Transformer encoders (, ) are deployed. The target encoder is updated via EMA of the context encoder's parameters. For each level, masked representations are extracted, while the context representations are computed on visible (non-masked) inputs: .
A 1-layer Transformer decoder (“predictor”) receives concatenated with mask tokens and positional embeddings to generate predictions for the masked slots. The main training signal per level is a Smooth L1 JEPA loss between predicted and target representations: $\mathcal{L}_{\mathrm{JEPA}}^{(l)} = \frac{1}{MBn^{(l)}d^{(l)}} \sum_{i=1}^M \sum_{b=1}^B \sum_{p=1}^{n^{(l)}} \sum_{q=1}^{d^{(l)}} \mathrm{SmoothL1}(\widehat{S}'_{b,p,q}^{(l)}(i), S_{b,p,q}^{(l)}(i))$ To prevent representational collapse, VICReg regularization terms (variance and covariance) are applied to both the target and context slot projections. The total per-level loss is thus:
4. Multi-Scale Loss Integration and Model Architecture
The final training objective is a weighted combination of the three semantic levels: Empirically, weights are found optimal. The core network architecture consists of:
- Convolutional feature extractor: three 1D Conv+ReLU layers with output channels , max-pooling stride 2, embedding dimension .
- Transformer encoders: one layer, eight attention heads, model dimension , MLP hidden size 1024, learnable positional encoding.
- Decoder (“predictor”): one-layer Transformer decoder, eight heads, corresponding dimension per level.
Masking involves four unique masks per trajectory, with mask ratios randomly drawn from , a 50%-50% blend of contiguous and scattered masked spans, and a context keep ratio of . Training uses the Adam optimizer, initial learning rate decayed by half every five epochs, over 20 epochs with batch size 64.
5. Experimental Protocol and Comparative Results
HiT-JEPA is evaluated on major urban trajectory datasets—Porto (1.37M), T-Drive, GeoLife—as well as Foursquare check-in data (Tokyo, NYC) and AIS maritime tracks. Three baselines are considered: TrajCL (contrastive), CLEAR (multi-positive contrastive), and T-JEPA (single-scale JEPA).
Two primary evaluation modes are implemented:
- Self-similarity retrieval: For each query half-trajectory, the task is to identify its paired half within a large database, under zero-shot and in-domain conditions. Mean-rank is reported under varying database sizes, sampling rates (), and perturbations ().
- Downstream fine-tuning: Encoder weights are frozen and a two-layer MLP is trained to regress onto classic trajectory similarity measures (EDR, LCSS, Hausdorff, Discrete Fréchet), with metrics HR@5, HR@20, R5@20.
HiT-JEPA consistently achieves competitive or superior retrieval performance. On in-domain datasets, it produces the lowest mean-rank in 5/6 settings (e.g., Porto: 1.027 vs. 1.029 for T-JEPA). Robustness to sparse sampling is observed on T-Drive (mean ranks ~1.040–1.041). For zero-shot transfer, HiT-JEPA yields 10–20% lower mean ranks than T-JEPA. Downstream, it improves over T-JEPA by +12.6% (Hausdorff), +19.9% (Fréchet) on T-Drive, and +6.4% on GeoLife (Li et al., 17 Jun 2025).
| Dataset | Retrieval Mean-rank (HiT-JEPA) | Retrieval Mean-rank (T-JEPA) | Downstream Gain (%) (HiT vs. T-JEPA) |
|---|---|---|---|
| Porto | 1.027 | 1.029 | -- |
| T-Drive | ~1.040–1.041 | -- | +12.6 (Hausdorff), +19.9 (Fréchet) |
| GeoLife | -- | -- | +6.4 (average) |
6. Implications and Design Insights
By explicitly structuring representations into hierarchical semantic levels, HiT-JEPA provides several empirical and methodological advantages:
- Simultaneous multi-scale modeling: The architecture captures both local (micro) and global (macro) semantic content.
- Top-down attention propagation: High-level context shapes representations in finer layers, focusing modeling capacity on semantically relevant micro-patterns identified at coarser scales.
- Robustness and generalization: Multi-objective alignment across levels produces embeddings that generalize across diverse datasets (urban, check-in, maritime), improving both retrieval and regression metrics over single-scale and contrastive learning baselines.
- Seamless multi-task compatibility: The unified embedding space supports both unsupervised retrieval and downstream supervised similarity regression.
7. Position within Trajectory Representation Learning
HiT-JEPA constitutes the first self-supervised system jointly learning point, sub-trajectory, and global semantics in an urban trajectory context through a hierarchical JEPA. Its hierarchical design, multi-scale prediction, and unified loss integration distinguish it from contrastive and prior single-scale JEPA models, enabling richer multi-scale representations that outperform prior approaches in both in-domain and cross-domain similarity tasks (Li et al., 17 Jun 2025).