One-Shot 3D Action Recognition

Updated 28 November 2025

The paper introduces a novel framework leveraging deep metric learning and graph neural networks to embed 3D skeleton data for reliable one-shot recognition.
One-shot 3D action recognition is defined as identifying human actions from a single support instance, challenging conventional supervised methods in dynamic scenarios.
Key methods include cross-modal transformers, multi-scale GCNs, and robust episodic training, achieving competitive accuracy on benchmarks like NTU RGB+D.

One-shot 3D action recognition refers to the identification of human actions in 3D (typically skeleton-based) time series data when only a single labeled instance per action class is available for support at test time. This regime is strictly harder than standard supervised action recognition and is motivated by robotic, surveillance, therapy, and healthcare scenarios where exhaustive action annotation is infeasible. The core task is to construct an embedding or distance metric such that novel actions—never seen in the training set, but presented once per class as labeled support during meta-testing—can be reliably recognized from unseen queries. Recent years have seen the emergence of deep metric learning, graph neural networks, multi-scale and part-aware prototypes, cross-modal transformers, and robust episodic training protocols for this setting.

1. Principles and Problem Statement

In the one-shot 3D action recognition framework, meta-learning episodes are drawn from a distribution over action classes (Berti et al., 2022). Each meta-episode consists of:

A support set $\mathcal{S} = \{(\mathbf{x}^s_c, c) \mid c \in \mathcal{C}_T\}$ of $N$ classes, with one labeled sample per class (N-way, 1-shot).
A query set $\mathcal{Q} = \{(\mathbf{x}^q_j, y_j)\}$ .
The goal is to compute an embedding function $f_\theta(\cdot)$ such that metric-based assignment in the embedding space $\epsilon = f_\theta(\mathbf{x})$ allows the correct class to be determined by nearest neighbor (or softmax over distances): $\hat{y} = \arg\min_c d(\epsilon^q, P_c)$ , with prototype $P_c = f_\theta(\mathbf{x}^s_c)$ .

This paradigm can be instantiated for open-set recognition, meta-graph learning, cross-modal fusion, and robustness to occlusion. Support and query sequences are typically 3D skeletons over $T$ frames with $J$ joints: $\mathbf{x} \in \mathbb{R}^{T \times J \times 3}$ .

2. Deep Metric Learning and Embedding-Based Methods

Signal-level representations are a dominant paradigm. Approaches such as SL-DML (Memmesheimer et al., 2020) and Skeleton-DML (Memmesheimer et al., 2020) encode the multivariate time series of skeleton data into a compact image-like form (e.g., matrix with rows as joints/channels and columns as time or concatenated $x,y,z$ blocks). These image representations are processed by standard CNNs (often ResNet-18) with subsequent fully connected layers to produce $L_2$ -normalized embeddings.

Learning is driven by metric losses (e.g., triplet or multi-similarity), encouraging smaller intra-class and larger inter-class distances.
Recognition is performed via nearest neighbor in the embedding space.
For multimodal scenarios (e.g., skeleton+IMU), flexible channel arrangements enable fusion (Memmesheimer et al., 2020).
Augmentation (e.g., rotation, translation) and mining of hard negatives are crucial for sample efficiency and robustness.

Quantitative results on NTU RGB+D 120 show these approaches (e.g., Skeleton-DML: 54.2–57.7% top-1 one-shot accuracy with augmentation) surpass classic one-shot baselines (Memmesheimer et al., 2020, Memmesheimer et al., 2020).

3. Graph and Multi-Scale Spatio-Temporal Methods

Part-aware, multi-scale, and local-component GCNs address the loss of discriminative power in global pooling. ALCA-GCN (Zhu et al., 2022) defines fine-grained spatial partitions (head, torso, hands, legs) and temporal segments (start, middle, end), creating local embeddings aligned across support/query. Adaptive Dependency Learning (ADL) employs self-attention over local components to focus the metric on action-critical substructures, suppressing noisy or irrelevant features.

The matching metric is a sum of $L_2$ distances between all aligned local component embeddings.
Training uses episodic meta-learning with negative log-likelihood loss over support-class distances.
Performance is robust: 57.6% on NTU-120, outperforming Skeleton-DML and prior GCNs (Zhu et al., 2022).

Multi-scale approaches (e.g., (Yang et al., 2023)) represent skeleton data at multiple spatial (joints, parts, limbs) and temporal (original, downsampled) resolutions. Earth Mover’s Distance (EMD) is used to match cross-scale semantic representations, yielding strong recognition accuracy (68.7% one-shot acc., NTU120).

Table: Key Model Families in One-Shot 3D Action Recognition

Model Family	Core Technique	Representative Reference
Signal-level DML	CNN/ResNet embedding	(Memmesheimer et al., 2020, Memmesheimer et al., 2020)
Part/multi-scale GCN	Dual-level/local GCN	(Chen et al., 2022, Zhu et al., 2022)
EMD-based Matching	Multi-scale EMD matching	(Yang et al., 2023)
Transformer-based, occlusion	Multi-stream LeViT, MAFM	(Peng et al., 2022)
LLM-guided cross-modal	Text-guided skeleton enc.	(Yan et al., 15 Mar 2024)
Open-set matching	TRX+OS discriminator	(Berti et al., 2022)

4. Open-Set and Robust Recognition

Open-set one-shot recognition—where the system may encounter actions not in the support set at test time—necessitates explicit mechanisms for “none-of-the-above.” The TRX-OS model (Berti et al., 2022) extends TRX meta-learning with an open-set discriminator ( $Disc$ ) that takes the per-pair prototype differences and outputs a scalar “match quality” score.

Match acceptance depends on whether $Disc(\Delta) > \tau$ ; otherwise, input is “rejected.”
The loss combines standard one-shot cross-entropy with OS binary cross-entropy.
On NTU-120, TRX-OS achieves 0.67 FSOS-ACC (5-way), outperforming distance-based confidence rejection by ~4%.

Occlusion robustness is addressed in Trans4SOAR (Peng et al., 2022), which introduces realistic 3D object-based and random occlusion simulation protocols. The model employs a three-stream (joints, velocities, bones) transformer with patch-level Mixed Attention Fusion (MAFM) and a Prototype Memory Bank for latent-space consistency. Under both synthetic and realistic occlusions, Trans4SOAR preserves SOTA accuracy (52–53% NTU-120 RE/RA occlusion), significantly outperforming CNN baselines.

Recent trends leverage external semantic information to overcome data scarcity and local information loss. CrossGLG (Yan et al., 15 Mar 2024) introduces LLM-generated global and local text descriptions of each action:

Global text summaries indicate which joints are key for an action; these guide a Joint Importance Determination (JID) module in the skeleton encoder.
Local per-joint text descriptions are embedded and fused with joint features via non-local multi-head attention, aggregating local-to-global representations.
Dual-branch training ensures consistency, but only the skeleton branch is used at inference to minimize cost.
Ablation studies show that global-to-local and local-to-global guidance yield additive gains (+8.3 points on NTU120, 20-way) with <3% parameter overhead. On NTU-120 with 100 base classes, CrossGLG atop InfoGCN yields 62.6% one-shot performance (Yan et al., 15 Mar 2024).

6. Specialized Protocols, Applications, and Evaluation

Protocols:

Most works follow episodic N-way, 1-shot meta-learning on large-scale skeleton datasets (NTU RGB+D/120, PKU-MMD, Kinetics), with novel class splits (Yang et al., 2023).
Standard metrics are top-1 accuracy, macro-averaged F1, and open-set accuracy variants (FSOS-ACC, (Berti et al., 2022)).

Applications:

Humanoid robotics, where fast adaptation to novel actions is critical (Berti et al., 2022).
Medical therapy (autism monitoring): TCN-based online one-shot recognition runs in real-time for therapy action detection (Sabater et al., 2021).

Evaluation under realistic occlusion, dynamic backgrounds, unseen sensor modalities (e.g., inertial, mocap), and multi-modality (e.g., fusion with RGB data) is increasingly emphasized (Peng et al., 2022, Memmesheimer et al., 2020).

7. Challenges, Extensions, and Open Problems

Challenges:

Fine-grained action discrimination, especially for hand-centric or subtle gestures, remains bottlenecked by imprecise joint labeling (e.g., confusion rates rise due to missing wrist/finger tracking (Berti et al., 2022)).
Occlusion, sensor noise, view-point variation, and temporal misalignment degrade performance if not explicitly addressed (Peng et al., 2022).
Open-set recognition requires robust confidence calibration and threshold tuning for rejection (Berti et al., 2022).

Extensions and Strategies:

Part-aware prototypes and spatial-temporal partitioning improve local discriminability (Chen et al., 2022, Zhu et al., 2022).
Multi-modal and privacy-preserving representations (position/orientation image fusion; (Xie et al., 2023)) enhance robustness, especially for medical use.
Transformer-based backbones offer improved spatial-temporal and cross-stream context aggregation, with augmentation and prototype banks for regularization (Peng et al., 2022).
Cross-modal guidance with LLMs is a promising direction for transferability and plug-and-play adaptation (Yan et al., 15 Mar 2024).

A plausible implication is that progress in one-shot 3D action recognition will continue to hinge on modular architectures that unify robust base embedding learning, semantic and distributional calibration, explicit local/global partitioning, and meta-episodic training under increasingly realistic data conditions, including occlusion, open-set, and multi-modal constraints.