Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Trokens: Few-Shot Action Recognition Tokens

Updated 6 August 2025
  • Trokens are semantic-aware trajectory tokens that combine adaptive, appearance-driven point sampling with explicit intra- and inter-trajectory motion models.
  • They fuse dynamic motion features with deep appearance cues through alignment and element-wise addition, enabling robust transformer-based video action recognition.
  • They address challenges in capturing fine-grained motion cues, achieving state-of-the-art performance on several few-shot action recognition benchmarks.

Trokens are a semantic-aware, relational trajectory token representation designed for few-shot action recognition in video understanding. The approach addresses fundamental challenges in integrating motion and appearance cues, particularly regarding the selection of discriminative tracking points and the modeling of their motion dynamics. Trokens leverage adaptive, appearance-driven point sampling in combination with explicit intra- and inter-trajectory motion models, resulting in a token set tailored for input to transformer-based architectures. The integration of these motion tokens with deep appearance features yields state-of-the-art performance across multiple few-shot action recognition benchmarks.

1. Semantic-Aware Sampling of Trajectory Points

The cornerstone of the Trokens methodology is its semantic-aware point sampling strategy. Unlike prior paradigms relying on uniform grid sampling, Trokens dynamically select spatial locations for trajectory tracking based on object relevance and semantic scale:

  • Appearance features are extracted from videos using self-supervised DINO patch tokens. These features cluster naturally according to object identities.
  • The patch features are partitioned into LL semantic clusters. Adaptive point sampling is then performed within each cluster, allocating higher density in clusters corresponding to small yet semantically critical objects, and lower density in those representing large or irrelevant background regions.
  • The sampled coordinates, Ps={(xsi,ysi)}i=1M\mathcal{P}_s = \{(x_s^i, y_s^i)\}_{i=1}^M, are subjected to point tracking—e.g., with the Co-tracker framework—to yield motion trajectories for each sampled point.

This approach increases the likelihood of capturing action-relevant motion trajectories while mitigating the risk of under-sampling small but important objects. It also addresses the limitations of uniform sampling that may neglect fine-grained, context-sensitive motion cues.

2. Motion Modeling: Intra- and Inter-Trajectory Dynamics

The Trokens framework constructs a dual-module motion modeling system, capturing both the individual dynamics and collective relationships among tracked trajectories.

2.1 Intra-Trajectory (Intra-Motion) Module

  • For a trajectory Pm=[(xtm,ytm)]\mathcal{P}^m = [(x_t^m, y_t^m)] over TT frames, Trokens compute framewise displacements:

Δxt=xtxtδ,Δyt=ytytδ\Delta x_t = x_t - x_{t-\delta},\quad \Delta y_t = y_t - y_{t-\delta}

for temporal offset δ\delta.

  • The displacement magnitude and orientation are:

Δdt=Δxt2+Δyt2,θt=arctan2(Δyt,Δxt)\Delta d_t = \sqrt{\Delta x_t^2 + \Delta y_t^2},\quad \theta_t = \arctan2(\Delta y_t, \Delta x_t)

  • Orientations θt\theta_t are histogrammed into BB bins (e.g., B=32B=32) over 360360^\circ, weighted by displacement Δdt\Delta d_t, yielding the Histogram of Oriented Displacements (HoD):

HHoD=fHoD(Pm)RT×BH_\text{HoD} = f_\text{HoD}(\mathcal{P}^m) \in \mathbb{R}^{T \times B}

  • HHoDH_\text{HoD} is projected into feature space via a fully connected (FC) layer to produce Fintra_motionRM×T×CF_\text{intra\_motion} \in \mathbb{R}^{M \times T \times C}.

2.2 Inter-Trajectory (Inter-Motion) Module

  • For each trajectory and timestep, pairwise displacements are computed with respect to all other trajectories:

dtm=[xtmxtm,  ytmytm]m=1MR2Md_t^m = [x_t^m - x_t^{m'},\; y_t^m - y_t^{m'}]_{m'=1}^M \in \mathbb{R}^{2M}

  • These descriptors encode relational, object-centric motion (e.g., tool versus manipulated object).
  • The descriptors are projected through an FC layer, yielding Finter_motionRM×T×CF_\text{inter\_motion} \in \mathbb{R}^{M \times T \times C}.

In aggregate, these modules provide a comprehensive modeling of motion, ranging from local trajectory signatures to object-object dependencies relevant for temporal action classification.

3. Integration of Motion and Appearance Features

Trokens integrate dynamic and appearance information through a series of alignment and fusion operations:

  • RGB (appearance) tokens are extracted per frame: FRGBRH×W×T×CF^{\text{RGB}} \in \mathbb{R}^{H \times W \times T \times C}.
  • Trajectory alignment is performed:

FtrajRGB=Align(FRGB,P)RM×T×CF^{\text{RGB}}_{\text{traj}} = \text{Align}(F^{\text{RGB}}, \mathcal{P}) \in \mathbb{R}^{M \times T \times C}

This operation reorders appearance features along each motion trajectory, ensuring temporal correspondences between motion and context.

  • Fusion of modalities is realized via element-wise addition:

Ffuse=FtrajRGB+Fintra_motion+Finter_motionF^\text{fuse} = F^{\text{RGB}}_{\text{traj}} + F_\text{intra\_motion} + F_\text{inter\_motion}

  • The fused tokens are input to a decoupled space–time transformer which separately applies self-attention on temporal and spatial axes. Final classification embedding is produced by cross-attention with a learnable CLS token.

This design enables transformers to process rich, trajectory-aligned spatiotemporal representations with explicit and discriminative motion encoding.

4. Quantitative Evaluation and Benchmarks

Trokens achieve state-of-the-art results on six few-shot action recognition benchmarks:

Dataset Improvement over SOTA Notes
Something-Something-V2 (Full) 2–5% (1–5-shot) Outperforms TATs in all tested regimes
Kinetics 1–2% (low-shot) Smaller gains; dataset is appearance-biased
HMDB51 Up to 10% (specific configs) Substantial improvements observed
UCF101 Significant gains Consistent advantages in all settings
FineGym Consistent advantages Robust across task variants

These improvements are attributed directly to the advanced sampling and explicit motion modeling, which enable the extraction of discriminative cues even in low-data regimes. For datasets with strong appearance signals (e.g., Kinetics), performance gains are modest, but for action datasets heavily reliant on motion context (e.g., HMDB51, SSV2), Trokens provide pronounced benefits.

5. Technical Summary and Formulas

Key mathematical operations and design details in Trokens are as follows:

  • Trajectory displacement: Δxt=xtxtδ\Delta x_t = x_t - x_{t-\delta}, Δyt=ytytδ\Delta y_t = y_t - y_{t-\delta}
  • Magnitude and orientation: Δdt=Δxt2+Δyt2\Delta d_t = \sqrt{\Delta x_t^2 + \Delta y_t^2}, θt=arctan2(Δyt,Δxt)\theta_t = \arctan2(\Delta y_t, \Delta x_t)
  • Histogram of Oriented Displacements: HHoD=fHoD(Pm)RT×BH_\text{HoD} = f_\text{HoD}(\mathcal{P}^m) \in \mathbb{R}^{T \times B}
  • Intra-motion feature projection: Fintra_motion=FC(fHoD(P))RM×T×CF_\text{intra\_motion} = \text{FC}(f_\text{HoD}(P)) \in \mathbb{R}^{M \times T \times C}
  • Pairwise inter-motion descriptor: dtm=[(xtmxtm,ytmytm)]m=1MR2Md_t^m = [(x_t^m - x_t^{m'}, y_t^m - y_t^{m'})]_{m'=1}^M \in \mathbb{R}^{2M}
  • Inter-motion feature projection: Finter_motion=FC(d)RM×T×CF_\text{inter\_motion} = \text{FC}(d) \in \mathbb{R}^{M \times T \times C}
  • Trajectory alignment: FtrajRGB=Align(FRGB,P)F^{\text{RGB}}_{\text{traj}} = \text{Align}(F^{\text{RGB}}, \mathcal{P})
  • Feature fusion: Ffuse=FtrajRGB+Fintra_motion+Finter_motionF^\text{fuse} = F^{\text{RGB}}_{\text{traj}} + F_\text{intra\_motion} + F_\text{inter\_motion}

These components are sequentially integrated prior to transformer-based classification.

6. Significance, Limitations, and Future Directions

Trokens represent a rigorous advance in the synthesis of appearance and motion cues for few-shot video action recognition. The semantic-aware sampling and dual-stage motion modeling enable robust performance even with limited labeled data. Noted limitations include reduced efficacy in scenes subject to motion blur or significant camera movement, conditions that may degrade the quality of trajectory extraction. The authors indicate that future work should address these challenges, potentially by enhancing tracker robustness and expanding the methodology to broader video understanding contexts beyond few-shot settings.

In summary, Trokens establish an effective, modular pipeline for trajectory-centric, semantic-aware action recognition, substantiated by empirical gains on a suite of challenging benchmarks.