Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Unified Multi-Entity Graph Network (UMEG-Net)

Updated 20 November 2025
  • The paper introduces UMEG-Net, a graph-based approach unifying players, sport-specific objects, and contextual cues for few-shot precise event spotting in sports videos.
  • It utilizes spatio-temporal graph convolution and multi-scale parameter-free temporal shift to capture complex, short- and long-range interactions in video frames.
  • A multimodal distillation framework transfers structured graph features to RGB encoders, enabling robust and data-efficient performance under limited labeled data.

Unified Multi-Entity Graph Network (UMEG-Net) is a graph-based approach for few-shot precise event spotting (PES) in sports video analytics. UMEG-Net explicitly represents players, sport-specific objects, and contextual environment cues as nodes in a unified graph, leveraging spatio-temporal graph convolution and parameter-free multi-scale temporal shift for robust, data-efficient event recognition. A multimodal knowledge distillation framework further enhances event spotting performance by transferring structured graph-derived representation to RGB-based encoders, enabling competitive accuracy under limited labeled data conditions and handling noisy keypoint input effectively (Liu et al., 18 Nov 2025).

1. Unified Multi-Entity Graph Representation

UMEG-Net represents each frame tt of a sports video as an undirected graph Gt=(Vt,Et)G_t = (V_t, E_t), where Vt=VptVbtVctV_t = V_{p_t} \cup V_{b_t} \cup V_{c_t}. The node set includes:

  • VptV_{p_t}: NN players, each modeled with KK human-skeleton keypoints Pit=(ji,1t,,ji,Kt)P_i^t = (j_{i,1}^t,\ldots,j_{i,K}^t) (e.g., from HRNet).
  • VbtV_{b_t}: Sport-specific object keypoints (e.g., tennis ball, shuttlecock).
  • VctV_{c_t}: Contextual landmarks such as court corners.

Graph connectivity is defined by distinct edge subsets:

  • EtintraE^{intra}_t: Connects joints within each skeleton by standard pose topology.
  • EtpbE^{p-b}_t: Connects selected player joints (e.g., wrist, ankle) to the sport-object node.
  • EtpcE^{p-c}_t: Connects foot joints to environmental nodes (court corners).
  • EtccE^{c-c}_t: Links the four court corners into a rectangle.

All edges are undirected. This design enables unified modeling of human-object-environment interactions crucial for identifying subtle event cues, which skeleton-only or pixel-only methods typically omit (Liu et al., 18 Nov 2025).

Node Type Example Entities per Frame Connectivity
Human skeletons N players × K joints EtintraE^{intra}_t pose topology
Sport object Ball or shuttlecock keypoint EtpbE^{p-b}_t
Contextual landmarks Four court corners EtpcE^{p-c}_t, EtccE^{c-c}_t

2. Spatio-Temporal Feature Extraction

Each GtG_t is embedded into initial node features Ht(0)RV×dH^{(0)}_t \in \mathbb{R}^{|V| \times d}. The graph sequence is processed by LL stacked UMEG blocks, each comprising:

a. Spatial Graph Convolution

At layer \ell, features are updated as

H(+1)=ReLU(A()H()W())H^{(\ell+1)} = \mathrm{ReLU}( A^{(\ell)} H^{(\ell)} W^{(\ell)} )

where A()A^{(\ell)} is the adjacency, W()W^{(\ell)} are trainable weights, and H()RT×V×dH^{(\ell)} \in \mathbb{R}^{T \times |V| \times d} aggregates spatio-temporal neighborhood features.

b. Multi-Scale Parameter-Free Temporal Shift

Inspired by Temporal Shift Module (TSM), each Ht()H_t^{(\ell)} is channel-wise partitioned: a static component ((12α)d(1-2\alpha)d), a forward-shifted (αd)(\alpha d), and a backward-shifted (αd)(\alpha d) with α=1/8\alpha = 1/8. For each shift scale Δ{1,2,4}\Delta \in \{1,2,4\}:

H~t(,Δ)=[Ht,static()  HtΔ,fwd()  Ht+Δ,bwd()]\widetilde{H}_t^{(\ell,\Delta)} = [ H_{t,static}^{(\ell)}\ |\ H_{t-\Delta, fwd}^{(\ell)}\ |\ H_{t+\Delta, bwd}^{(\ell)} ]

Out-of-bounds indices are zero-padded. The spatial GCN is applied per scale, branches are reduced and fused, and the result is added to the residual:

Ht(+1)=F2(ReLU(Ut()))+Ht()H_t^{(\ell+1)} = F_2(\mathrm{ReLU}(U_t^{(\ell)})) + H_t^{(\ell)}

where Ut()U_t^{(\ell)} is the concatenation of the outputs at different Δ\Delta, each reduced via a linear F1F_1. This provides expanded short-, mid-, and long-range temporal receptive fields with zero additional parameters. This mechanism facilitates effective aggregation of relevant cues across temporal context, accommodating rapid and subtle events typical in PES tasks.

3. Multimodal Distillation Framework

To address the unreliability of keypoint detection and improve model robustness, UMEG-Net adopts a multimodal distillation paradigm in which the UMEG-Net encoder (teacher) provides graph-based representations FtchRT×dF_{tch} \in \mathbb{R}^{T \times d}, and a separate RGB-only student encoder (εstu\varepsilon_{stu}, consisting of VideoMAEv2 \rightarrow bi-GRU \rightarrow linear projection) learns to mimic the teacher’s outputs using an L2 feature-matching loss:

Lfeat=1Tt=1TFtch(t)Fstu(t)22L_{feat} = \frac{1}{T} \sum_{t=1}^T \| F_{tch}^{(t)} - F_{stu}^{(t)} \|_2^2

During distillation, εtch\varepsilon_{tch} parameters are frozen, and the student is updated to align with the structured representations. Distillation leverages large unlabeled sets, expanding the applicability of UMEG-Net to scenarios where keypoints are unavailable or unreliable, and enabling high-quality RGB-based inference.

4. Training Procedures for Few-Shot Event Spotting

UMEG-Net is optimized in two sequential stages tailored to few-shot constraints:

Stage 1: Graph-Based Few-Shot Adaptation

  • kk labeled clips (k{15,25,50,100}k \in \{15,25,50,100\}) are randomly sampled from target domains, each containing multiple events.
  • Dense, per-frame classification is performed using cross-entropy, with event frames weighted five-fold to address extreme class imbalance (only \sim3% of frames are events).
  • AdamW optimizer, learning rate 1×1031\times 10^{-3}; three-step linear warm-up followed by cosine annealing over 50 (or 30) epochs.
  • Data augmentation: random crops, color jitter.
  • Model selection based on validation F1F_1 and edit score.

Stage 2: Distillation and RGB Student Finetuning

  • Unlabeled clips are used for distillation (learning rate 1×1041\times 10^{-4}, 50/30 epochs).
  • Student backbone is frozen; only the temporal block, localizer, and classifier are finetuned on labeled data for 10 epochs (1×1031\times 10^{-3}).
  • Unlike prototypical few-shot paradigms, UMEG-Net trains directly on the kk labeled clips without explicit support-query alternation.

A plausible implication is that UMEG-Net’s two-stage regimen relaxes the dependency on abundant keypoint annotations and allows efficient adaptation to new domains with minimal supervision.

5. Empirical Evaluation and Results

UMEG-Net has been evaluated on multiple datasets: F³Set-Tennis (11,584 rallies), ShuttleSet (3,685), FineGym-BB (1,112), Figure Skating (371), and SoccerNet BAS. In few-shot settings (k=100k=100):

  • Compared to RGB SOTA (E2E-Spot, T-DEED, F³ED), UMEG-Net achieves absolute F1F_1 gains of +1.3–5.5% and edit score gains of +1.3–16.4%.
  • Against skeleton-only GCNs (MSG3D, AAGCN, CTRGCN, STGCN++, ProtoGCN, BlockGCN), UMEG-Net surpasses the top baseline (BlockGCN) by +2.5% F1F_1/+13.4%+13.4\% edit on F³Set-Tennis, and +2.1%/+4.6% on ShuttleSet, using only 2.2M parameters.
  • The distilled RGB student (UMEG-Netdistill_{distill}) provides a further +5.8% F1F_1, +6.7% edit improvement, demonstrating the complementarity of modalities.
  • Under full supervision, UMEG-Net remains competitive or outperforms E2E-Spot on 3/5 datasets.

Metrics include mean per-class F1F_1 under strict temporal windows (±δ\pm \delta, with δ\delta=1 frame or 1 second) and Levenshtein-based edit score, which penalizes missing or misordered events.

6. Ablation Studies and Analysis

Extensive ablations support UMEG-Net’s core design:

  • Incremental addition of graph node types (skeleton \rightarrow skeleton+object \rightarrow skeleton+object+court) shows substantial cumulative improvement (e.g., F1F_1 on F³Set rises from 6.6 to 9.4).
  • Multi-scale temporal shift (Δ{1,2,4}\Delta \in \{1,2,4\}) distinctly outperforms restricted settings ({1}\{1\} or {1,2}\{1,2\}).
  • L2-based multimodal distillation exceeds self-supervised contrastive objectives (SIMCLR-style) by a large margin.

This evidence indicates that the unified multi-entity representation, parameter-free temporal augmentation, and cross-modal knowledge transfer are pivotal to UMEG-Net’s superior few-shot PES accuracy.


UMEG-Net thus constitutes a lightweight, scalable, and empirically validated solution for few-shot, fine-grained, frame-level event recognition in structured video domains, with documented advantages over both pixel-based and skeleton-only baselines under stringent evaluation protocols (Liu et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Unified Multi-Entity Graph Network (UMEG-Net).