Unified Multi-Entity Graph Network (UMEG-Net)
- The paper introduces UMEG-Net, a graph-based approach unifying players, sport-specific objects, and contextual cues for few-shot precise event spotting in sports videos.
- It utilizes spatio-temporal graph convolution and multi-scale parameter-free temporal shift to capture complex, short- and long-range interactions in video frames.
- A multimodal distillation framework transfers structured graph features to RGB encoders, enabling robust and data-efficient performance under limited labeled data.
Unified Multi-Entity Graph Network (UMEG-Net) is a graph-based approach for few-shot precise event spotting (PES) in sports video analytics. UMEG-Net explicitly represents players, sport-specific objects, and contextual environment cues as nodes in a unified graph, leveraging spatio-temporal graph convolution and parameter-free multi-scale temporal shift for robust, data-efficient event recognition. A multimodal knowledge distillation framework further enhances event spotting performance by transferring structured graph-derived representation to RGB-based encoders, enabling competitive accuracy under limited labeled data conditions and handling noisy keypoint input effectively (Liu et al., 18 Nov 2025).
1. Unified Multi-Entity Graph Representation
UMEG-Net represents each frame of a sports video as an undirected graph , where . The node set includes:
- : players, each modeled with human-skeleton keypoints (e.g., from HRNet).
- : Sport-specific object keypoints (e.g., tennis ball, shuttlecock).
- : Contextual landmarks such as court corners.
Graph connectivity is defined by distinct edge subsets:
- : Connects joints within each skeleton by standard pose topology.
- : Connects selected player joints (e.g., wrist, ankle) to the sport-object node.
- : Connects foot joints to environmental nodes (court corners).
- : Links the four court corners into a rectangle.
All edges are undirected. This design enables unified modeling of human-object-environment interactions crucial for identifying subtle event cues, which skeleton-only or pixel-only methods typically omit (Liu et al., 18 Nov 2025).
| Node Type | Example Entities per Frame | Connectivity |
|---|---|---|
| Human skeletons | N players × K joints | pose topology |
| Sport object | Ball or shuttlecock keypoint | |
| Contextual landmarks | Four court corners | , |
2. Spatio-Temporal Feature Extraction
Each is embedded into initial node features . The graph sequence is processed by stacked UMEG blocks, each comprising:
a. Spatial Graph Convolution
At layer , features are updated as
where is the adjacency, are trainable weights, and aggregates spatio-temporal neighborhood features.
b. Multi-Scale Parameter-Free Temporal Shift
Inspired by Temporal Shift Module (TSM), each is channel-wise partitioned: a static component (), a forward-shifted , and a backward-shifted with . For each shift scale :
Out-of-bounds indices are zero-padded. The spatial GCN is applied per scale, branches are reduced and fused, and the result is added to the residual:
where is the concatenation of the outputs at different , each reduced via a linear . This provides expanded short-, mid-, and long-range temporal receptive fields with zero additional parameters. This mechanism facilitates effective aggregation of relevant cues across temporal context, accommodating rapid and subtle events typical in PES tasks.
3. Multimodal Distillation Framework
To address the unreliability of keypoint detection and improve model robustness, UMEG-Net adopts a multimodal distillation paradigm in which the UMEG-Net encoder (teacher) provides graph-based representations , and a separate RGB-only student encoder (, consisting of VideoMAEv2 bi-GRU linear projection) learns to mimic the teacher’s outputs using an L2 feature-matching loss:
During distillation, parameters are frozen, and the student is updated to align with the structured representations. Distillation leverages large unlabeled sets, expanding the applicability of UMEG-Net to scenarios where keypoints are unavailable or unreliable, and enabling high-quality RGB-based inference.
4. Training Procedures for Few-Shot Event Spotting
UMEG-Net is optimized in two sequential stages tailored to few-shot constraints:
Stage 1: Graph-Based Few-Shot Adaptation
- labeled clips () are randomly sampled from target domains, each containing multiple events.
- Dense, per-frame classification is performed using cross-entropy, with event frames weighted five-fold to address extreme class imbalance (only 3% of frames are events).
- AdamW optimizer, learning rate ; three-step linear warm-up followed by cosine annealing over 50 (or 30) epochs.
- Data augmentation: random crops, color jitter.
- Model selection based on validation and edit score.
Stage 2: Distillation and RGB Student Finetuning
- Unlabeled clips are used for distillation (learning rate , 50/30 epochs).
- Student backbone is frozen; only the temporal block, localizer, and classifier are finetuned on labeled data for 10 epochs ().
- Unlike prototypical few-shot paradigms, UMEG-Net trains directly on the labeled clips without explicit support-query alternation.
A plausible implication is that UMEG-Net’s two-stage regimen relaxes the dependency on abundant keypoint annotations and allows efficient adaptation to new domains with minimal supervision.
5. Empirical Evaluation and Results
UMEG-Net has been evaluated on multiple datasets: F³Set-Tennis (11,584 rallies), ShuttleSet (3,685), FineGym-BB (1,112), Figure Skating (371), and SoccerNet BAS. In few-shot settings ():
- Compared to RGB SOTA (E2E-Spot, T-DEED, F³ED), UMEG-Net achieves absolute gains of +1.3–5.5% and edit score gains of +1.3–16.4%.
- Against skeleton-only GCNs (MSG3D, AAGCN, CTRGCN, STGCN++, ProtoGCN, BlockGCN), UMEG-Net surpasses the top baseline (BlockGCN) by +2.5% / edit on F³Set-Tennis, and +2.1%/+4.6% on ShuttleSet, using only 2.2M parameters.
- The distilled RGB student (UMEG-Net) provides a further +5.8% , +6.7% edit improvement, demonstrating the complementarity of modalities.
- Under full supervision, UMEG-Net remains competitive or outperforms E2E-Spot on 3/5 datasets.
Metrics include mean per-class under strict temporal windows (, with =1 frame or 1 second) and Levenshtein-based edit score, which penalizes missing or misordered events.
6. Ablation Studies and Analysis
Extensive ablations support UMEG-Net’s core design:
- Incremental addition of graph node types (skeleton skeleton+object skeleton+object+court) shows substantial cumulative improvement (e.g., on F³Set rises from 6.6 to 9.4).
- Multi-scale temporal shift () distinctly outperforms restricted settings ( or ).
- L2-based multimodal distillation exceeds self-supervised contrastive objectives (SIMCLR-style) by a large margin.
This evidence indicates that the unified multi-entity representation, parameter-free temporal augmentation, and cross-modal knowledge transfer are pivotal to UMEG-Net’s superior few-shot PES accuracy.
UMEG-Net thus constitutes a lightweight, scalable, and empirically validated solution for few-shot, fine-grained, frame-level event recognition in structured video domains, with documented advantages over both pixel-based and skeleton-only baselines under stringent evaluation protocols (Liu et al., 18 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free