Papers
Topics
Authors
Recent
2000 character limit reached

HiT-JEPA: Hierarchical Urban Trajectory Embeddings

Updated 2 January 2026
  • HiT-JEPA is a self-supervised framework for urban trajectory representation that employs a three-level hierarchy to capture fine-grained movements and global patterns.
  • It leverages joint embedding predictive objectives along with VICReg regularization, processing point-level, sub-trajectory, and global abstractions for robust similarity measurement.
  • The multi-scale loss integration and resilient architecture design yield improved retrieval and regression performance over conventional single-scale methods.

HiT-JEPA (Hierarchical Interactions of Trajectory Semantics via a Joint Embedding Predictive Architecture) is a self-supervised framework for learning hierarchical, multi-scale representations of urban trajectory data. Designed to address the challenges associated with capturing both fine-grained and high-level semantic information from sequential GPS data, HiT-JEPA employs a three-level architecture that explicitly models pointwise, segment-level, and global abstractions. The method leverages joint embedding predictive objectives across these multiple semantic levels to facilitate robust similarity computation and generalization across domains (Li et al., 17 Jun 2025).

1. Formalization and Problem Scope

The primary objective of HiT-JEPA is trajectory similarity computation. Given a set of trajectories, where each trajectory TT is an ordered sequence of GPS points (T=(p1,p2,...,pn)T = (p_1, p_2, ..., p_n), piR2p_i \in \mathbb{R}^2), each point is mapped to a pre-trained region embedding hδ(pi)Rdh_{\delta(p_i)} \in \mathbb{R}^d via a hex-grid index δ()\delta(\cdot). The framework learns a function f:TzRDf: T \mapsto z \in \mathbb{R}^D which embeds the trajectory such that the similarity between any two trajectories Ta,TbT^a, T^b in the learned space approximates a heuristic similarity Sgold(Ta,Tb)S_{\text{gold}}(T^a, T^b), often instantiated as a Fréchet or edit-distance-based measure. The approach targets simultaneous fidelity to local transitions and long-term dependency modeling.

2. Three-Level Semantic Hierarchy

HiT-JEPA introduces a three-layer sequential hierarchy, each capturing trajectory information at a different scale:

  1. Point-Level (Layer 1): Processes inputs T(1)(Rd)n1T^{(1)} \in (\mathbb{R}^d)^{n_1}, with n1=nn_1 = n, modeling local micro-movements such as turns or stops.
  2. Sub-Trajectory (Layer 2): Generated via MaxPool1D(Conv1D(T(1)))(R2d)n2\operatorname{MaxPool1D}(\operatorname{Conv1D}(T^{(1)})) \in (\mathbb{R}^{2d})^{n_2}, with n2=n1/2n_2 = \lfloor n_1 / 2 \rfloor. Encodes mesoscopic patterns (e.g., short trajectory segments).
  3. Global Abstraction (Layer 3): Built by again applying convolution and max-pooling: MaxPool1D(Conv1D(T(2)))(R4d)n3\operatorname{MaxPool1D}(\operatorname{Conv1D}(T^{(2)})) \in (\mathbb{R}^{4d})^{n_3}, with n3=n2/2n_3 = \lfloor n_2 / 2 \rfloor. This yields coarse representations of the complete route.

This hierarchical construction enables joint modeling of both local transitions and holistic movement routines, facilitating a multi-resolution understanding of trajectories.

3. Layer-Wise Encoding and Predictive Objectives

At each level l{1,2,3}l \in \{1, 2, 3\}, paired context and target Transformer encoders (Eθ(l)E_\theta^{(l)}, Eθˉ(l)E_{\bar{\theta}}^{(l)}) are deployed. The target encoder is updated via EMA of the context encoder's parameters. For each level, masked representations S(l)=Eθˉ(l)(T(l))S^{(l)} = E_{\bar{\theta}}^{(l)}(T^{(l)}) are extracted, while the context representations are computed on visible (non-masked) inputs: S(l)=Eθ(l)(T(l))S'^{(l)} = E_\theta^{(l)}(T'^{(l)}).

A 1-layer Transformer decoder (“predictor”) Dϕ(l)D_\phi^{(l)} receives S(l)S'^{(l)} concatenated with mask tokens and positional embeddings to generate predictions S^(l)(i)\widehat{S}'^{(l)}(i) for the masked slots. The main training signal per level is a Smooth L1 JEPA loss between predicted and target representations: $\mathcal{L}_{\mathrm{JEPA}}^{(l)} = \frac{1}{MBn^{(l)}d^{(l)}} \sum_{i=1}^M \sum_{b=1}^B \sum_{p=1}^{n^{(l)}} \sum_{q=1}^{d^{(l)}} \mathrm{SmoothL1}(\widehat{S}'_{b,p,q}^{(l)}(i), S_{b,p,q}^{(l)}(i))$ To prevent representational collapse, VICReg regularization terms (variance and covariance) are applied to both the target and context slot projections. The total per-level loss is thus: L(l)=LJEPA(l)+LVICReg(l)\mathcal{L}^{(l)} = \mathcal{L}_{\mathrm{JEPA}}^{(l)} + \mathcal{L}_{\mathrm{VICReg}}^{(l)}

4. Multi-Scale Loss Integration and Model Architecture

The final training objective is a weighted combination of the three semantic levels: L=λ1L(1)+λ2L(2)+λ3L(3)\mathcal{L} = \lambda_1 \mathcal{L}^{(1)} + \lambda_2 \mathcal{L}^{(2)} + \lambda_3 \mathcal{L}^{(3)} Empirically, weights (λ1,λ2,λ3)=(0.05,0.15,0.80)(\lambda_1, \lambda_2, \lambda_3) = (0.05, 0.15, 0.80) are found optimal. The core network architecture consists of:

  • Convolutional feature extractor: three 1D Conv+ReLU layers with output channels {d,2d,4d}\{d, 2d, 4d\}, max-pooling stride 2, embedding dimension d=256d=256.
  • Transformer encoders: one layer, eight attention heads, model dimension d(l)d^{(l)}, MLP hidden size 1024, learnable positional encoding.
  • Decoder (“predictor”): one-layer Transformer decoder, eight heads, corresponding dimension per level.

Masking involves four unique masks per trajectory, with mask ratios randomly drawn from {10%,15%,20%,25%,30%}\{10\%, 15\%, 20\%, 25\%, 30\%\}, a 50%-50% blend of contiguous and scattered masked spans, and a context keep ratio of 85%100%85\%-100\%. Training uses the Adam optimizer, initial learning rate 1×1041\times10^{-4} decayed by half every five epochs, over 20 epochs with batch size 64.

5. Experimental Protocol and Comparative Results

HiT-JEPA is evaluated on major urban trajectory datasets—Porto (1.37M), T-Drive, GeoLife—as well as Foursquare check-in data (Tokyo, NYC) and AIS maritime tracks. Three baselines are considered: TrajCL (contrastive), CLEAR (multi-positive contrastive), and T-JEPA (single-scale JEPA).

Two primary evaluation modes are implemented:

  • Self-similarity retrieval: For each query half-trajectory, the task is to identify its paired half within a large database, under zero-shot and in-domain conditions. Mean-rank is reported under varying database sizes, sampling rates (ρs\rho_s), and perturbations (ρd\rho_d).
  • Downstream fine-tuning: Encoder weights are frozen and a two-layer MLP is trained to regress onto classic trajectory similarity measures (EDR, LCSS, Hausdorff, Discrete Fréchet), with metrics HR@5, HR@20, R5@20.

HiT-JEPA consistently achieves competitive or superior retrieval performance. On in-domain datasets, it produces the lowest mean-rank in 5/6 settings (e.g., Porto: 1.027 vs. 1.029 for T-JEPA). Robustness to sparse sampling is observed on T-Drive (mean ranks ~1.040–1.041). For zero-shot transfer, HiT-JEPA yields 10–20% lower mean ranks than T-JEPA. Downstream, it improves over T-JEPA by +12.6% (Hausdorff), +19.9% (Fréchet) on T-Drive, and +6.4% on GeoLife (Li et al., 17 Jun 2025).

Dataset Retrieval Mean-rank (HiT-JEPA) Retrieval Mean-rank (T-JEPA) Downstream Gain (%) (HiT vs. T-JEPA)
Porto 1.027 1.029 --
T-Drive ~1.040–1.041 -- +12.6 (Hausdorff), +19.9 (Fréchet)
GeoLife -- -- +6.4 (average)

6. Implications and Design Insights

By explicitly structuring representations into hierarchical semantic levels, HiT-JEPA provides several empirical and methodological advantages:

  • Simultaneous multi-scale modeling: The architecture captures both local (micro) and global (macro) semantic content.
  • Top-down attention propagation: High-level context shapes representations in finer layers, focusing modeling capacity on semantically relevant micro-patterns identified at coarser scales.
  • Robustness and generalization: Multi-objective alignment across levels produces embeddings that generalize across diverse datasets (urban, check-in, maritime), improving both retrieval and regression metrics over single-scale and contrastive learning baselines.
  • Seamless multi-task compatibility: The unified embedding space supports both unsupervised retrieval and downstream supervised similarity regression.

7. Position within Trajectory Representation Learning

HiT-JEPA constitutes the first self-supervised system jointly learning point, sub-trajectory, and global semantics in an urban trajectory context through a hierarchical JEPA. Its hierarchical design, multi-scale prediction, and unified loss integration distinguish it from contrastive and prior single-scale JEPA models, enabling richer multi-scale representations that outperform prior approaches in both in-domain and cross-domain similarity tasks (Li et al., 17 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hiera-JEPA.