Trokens: Few-Shot Action Recognition Tokens

Updated 6 August 2025

Trokens are semantic-aware trajectory tokens that combine adaptive, appearance-driven point sampling with explicit intra- and inter-trajectory motion models.
They fuse dynamic motion features with deep appearance cues through alignment and element-wise addition, enabling robust transformer-based video action recognition.
They address challenges in capturing fine-grained motion cues, achieving state-of-the-art performance on several few-shot action recognition benchmarks.

Trokens are a semantic-aware, relational trajectory token representation designed for few-shot action recognition in video understanding. The approach addresses fundamental challenges in integrating motion and appearance cues, particularly regarding the selection of discriminative tracking points and the modeling of their motion dynamics. Trokens leverage adaptive, appearance-driven point sampling in combination with explicit intra- and inter-trajectory motion models, resulting in a token set tailored for input to transformer-based architectures. The integration of these motion tokens with deep appearance features yields state-of-the-art performance across multiple few-shot action recognition benchmarks.

1. Semantic-Aware Sampling of Trajectory Points

The cornerstone of the Trokens methodology is its semantic-aware point sampling strategy. Unlike prior paradigms relying on uniform grid sampling, Trokens dynamically select spatial locations for trajectory tracking based on object relevance and semantic scale:

Appearance features are extracted from videos using self-supervised DINO patch tokens. These features cluster naturally according to object identities.
The patch features are partitioned into $L$ semantic clusters. Adaptive point sampling is then performed within each cluster, allocating higher density in clusters corresponding to small yet semantically critical objects, and lower density in those representing large or irrelevant background regions.
The sampled coordinates, $\mathcal{P}_s = \{(x_s^i, y_s^i)\}_{i=1}^M$ , are subjected to point tracking—e.g., with the Co-tracker framework—to yield motion trajectories for each sampled point.

This approach increases the likelihood of capturing action-relevant motion trajectories while mitigating the risk of under-sampling small but important objects. It also addresses the limitations of uniform sampling that may neglect fine-grained, context-sensitive motion cues.

2. Motion Modeling: Intra- and Inter-Trajectory Dynamics

The Trokens framework constructs a dual-module motion modeling system, capturing both the individual dynamics and collective relationships among tracked trajectories.

2.1 Intra-Trajectory (Intra-Motion) Module

For a trajectory $\mathcal{P}^m = [(x_t^m, y_t^m)]$ over $T$ frames, Trokens compute framewise displacements:

$\Delta x_t = x_t - x_{t-\delta},\quad \Delta y_t = y_t - y_{t-\delta}$

for temporal offset $\delta$ .

The displacement magnitude and orientation are:

$\Delta d_t = \sqrt{\Delta x_t^2 + \Delta y_t^2},\quad \theta_t = \arctan2(\Delta y_t, \Delta x_t)$

Orientations $\theta_t$ are histogrammed into $B$ bins (e.g., $B=32$ ) over $360^\circ$ , weighted by displacement $\Delta d_t$ , yielding the Histogram of Oriented Displacements (HoD):

$H_\text{HoD} = f_\text{HoD}(\mathcal{P}^m) \in \mathbb{R}^{T \times B}$

$H_\text{HoD}$ is projected into feature space via a fully connected (FC) layer to produce $F_\text{intra\_motion} \in \mathbb{R}^{M \times T \times C}$ .

2.2 Inter-Trajectory (Inter-Motion) Module

For each trajectory and timestep, pairwise displacements are computed with respect to all other trajectories:

$d_t^m = [x_t^m - x_t^{m'},\; y_t^m - y_t^{m'}]_{m'=1}^M \in \mathbb{R}^{2M}$

These descriptors encode relational, object-centric motion (e.g., tool versus manipulated object).
The descriptors are projected through an FC layer, yielding $F_\text{inter\_motion} \in \mathbb{R}^{M \times T \times C}$ .

In aggregate, these modules provide a comprehensive modeling of motion, ranging from local trajectory signatures to object-object dependencies relevant for temporal action classification.

3. Integration of Motion and Appearance Features

Trokens integrate dynamic and appearance information through a series of alignment and fusion operations:

RGB (appearance) tokens are extracted per frame: $F^{\text{RGB}} \in \mathbb{R}^{H \times W \times T \times C}$ .
Trajectory alignment is performed:

$F^{\text{RGB}}_{\text{traj}} = \text{Align}(F^{\text{RGB}}, \mathcal{P}) \in \mathbb{R}^{M \times T \times C}$

This operation reorders appearance features along each motion trajectory, ensuring temporal correspondences between motion and context.

Fusion of modalities is realized via element-wise addition:

$F^\text{fuse} = F^{\text{RGB}}_{\text{traj}} + F_\text{intra\_motion} + F_\text{inter\_motion}$

The fused tokens are input to a decoupled space–time transformer which separately applies self-attention on temporal and spatial axes. Final classification embedding is produced by cross-attention with a learnable CLS token.

This design enables transformers to process rich, trajectory-aligned spatiotemporal representations with explicit and discriminative motion encoding.

4. Quantitative Evaluation and Benchmarks

Trokens achieve state-of-the-art results on six few-shot action recognition benchmarks:

Dataset	Improvement over SOTA	Notes
Something-Something-V2 (Full)	2–5% (1–5-shot)	Outperforms TATs in all tested regimes
Kinetics	1–2% (low-shot)	Smaller gains; dataset is appearance-biased
HMDB51	Up to 10% (specific configs)	Substantial improvements observed
UCF101	Significant gains	Consistent advantages in all settings
FineGym	Consistent advantages	Robust across task variants

These improvements are attributed directly to the advanced sampling and explicit motion modeling, which enable the extraction of discriminative cues even in low-data regimes. For datasets with strong appearance signals (e.g., Kinetics), performance gains are modest, but for action datasets heavily reliant on motion context (e.g., HMDB51, SSV2), Trokens provide pronounced benefits.

5. Technical Summary and Formulas

Key mathematical operations and design details in Trokens are as follows:

Trajectory displacement: $\Delta x_t = x_t - x_{t-\delta}$ , $\Delta y_t = y_t - y_{t-\delta}$
Magnitude and orientation: $\Delta d_t = \sqrt{\Delta x_t^2 + \Delta y_t^2}$ , $\theta_t = \arctan2(\Delta y_t, \Delta x_t)$
Histogram of Oriented Displacements: $H_\text{HoD} = f_\text{HoD}(\mathcal{P}^m) \in \mathbb{R}^{T \times B}$
Intra-motion feature projection: $F_\text{intra\_motion} = \text{FC}(f_\text{HoD}(P)) \in \mathbb{R}^{M \times T \times C}$
Pairwise inter-motion descriptor: $d_t^m = [(x_t^m - x_t^{m'}, y_t^m - y_t^{m'})]_{m'=1}^M \in \mathbb{R}^{2M}$
Inter-motion feature projection: $F_\text{inter\_motion} = \text{FC}(d) \in \mathbb{R}^{M \times T \times C}$
Trajectory alignment: $F^{\text{RGB}}_{\text{traj}} = \text{Align}(F^{\text{RGB}}, \mathcal{P})$
Feature fusion: $F^\text{fuse} = F^{\text{RGB}}_{\text{traj}} + F_\text{intra\_motion} + F_\text{inter\_motion}$

These components are sequentially integrated prior to transformer-based classification.

6. Significance, Limitations, and Future Directions

Trokens represent a rigorous advance in the synthesis of appearance and motion cues for few-shot video action recognition. The semantic-aware sampling and dual-stage motion modeling enable robust performance even with limited labeled data. Noted limitations include reduced efficacy in scenes subject to motion blur or significant camera movement, conditions that may degrade the quality of trajectory extraction. The authors indicate that future work should address these challenges, potentially by enhancing tracker robustness and expanding the methodology to broader video understanding contexts beyond few-shot settings.

In summary, Trokens establish an effective, modular pipeline for trajectory-centric, semantic-aware action recognition, substantiated by empirical gains on a suite of challenging benchmarks.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Trokens.