Tennis Action Recognition

Updated 22 April 2026

Tennis action recognition is the process of identifying, segmenting, and classifying tennis-specific actions using multimodal data such as video, IMUs, and motion capture.
Advanced methods like 3D ConvNets, CNN-LSTM models, and SVM pipelines demonstrate high accuracy by extracting spatial-temporal features and biomechanical cues.
The field supports practical applications in personalized coaching, live match analytics, and tactical evaluation while addressing challenges like occlusion and dataset limitations.

Tennis action recognition is the computational task of identifying, segmenting, and categorizing tennis-specific player actions—such as strokes, skill levels, and movement phases—using sensor data, video streams, or fusion of multimodal inputs. This domain encompasses both fine-grained gesture recognition (e.g., differentiating serve types) and phase segmentation (e.g., isolating backswing or follow-through), and supports applications in automatic analytics, tactical evaluation, and equitable coaching. The field integrates methods from supervised learning, deep learning, spatiotemporal signal processing, and biomechanics, and faces unique challenges due to rapid ball/player motion, occlusion, and inter-player variability.

1. Benchmark Datasets and Domain-Specific Sensor Modalities

Tennis action recognition leverages several benchmark datasets and a range of sensor platforms:

Video Datasets: The THETIS dataset (Hovad et al., 2024, Dashore et al., 4 Oct 2025) is the primary open-source benchmark, comprising 1,980–8,374 video clips labeled across 12 fine-grained stroke types (including various forehands, backhands, serves, volleys, and smashes). Clips are ~2–5 s, recorded at ~25–30 fps, and include RGB, depth, and 3D skeleton streams. Other datasets, such as ACASVA and TenniSet, focus respectively on short match segments with “hit”/“serve” annotations and full-match Olympic broadcasts with coarse event labels (Wu et al., 2022).
Wearable IMUs: Wrist-mounted inertial measurement units (IMUs), such as those available on Apple Watch or Xsens platforms, are used for direct motion capture. Dominant-arm IMU data provides high-correspondence with racket motion, enabling phase and skill classification (Gao et al., 2024), while passive-arm IMUs (placed on the non-dominant arm) facilitate minimally invasive analytics, achieving comparable shot recognition performance (classification accuracy: dominant 90.1%; passive 88.2%) (Park et al., 31 Jul 2025).
Marker-Based Motion Capture: Laboratory-grade marker systems (e.g., BTS SMART-e 900, 22 markers) afford sub-millimeter 3D accuracy, crucial for fine-grained biomechanical analysis and coach-driven labeling (Bačić, 2018).
Ball Trajectory Cameras: In table tennis (with direct portability to tennis), single- or multi-camera setups (e.g., GoPro Hero, high-fps) are used to track ball coordinates for action recognition based solely on ball flight path (Kulkarni et al., 2023).

Each data modality presents distinct tradeoffs in coverage, fidelity, and on-field deployability.

2. Deep Learning and Machine Learning Approaches

Recent advances draw from deep convolutional and recurrent models, hybrid attention mechanisms, and classical ML pipelines:

Video-Based Deep Models: Two-stream architectures combining RGB and optical flow, such as the SlowFast ResNet-50 (Hovad et al., 2024), twin spatio-temporal CNNs (Martin et al., 2022, Martin et al., 2020), and transformer-based models (VideoMAE) (He et al., 2024), are designed to capture both spatial context (body pose, racket orientation) and rapid temporal transitions (stroke onset/offset). Historical LSTM variants that weight frame contributions achieve performance improvements over vanilla LSTM baselines (THETIS: 0.74 vs. 0.56 accuracy) (Cai et al., 2018).
Sensor-Based ML Pipelines: RBF-kernel SVMs with PCA-reduced IMU features distinguish skill levels (77.1% accuracy) and swing phases (ROC-AUC up to 0.87) (Gao et al., 2024). RBF networks using polynomial regressions of sweet-spot marker trajectories deliver personalized swing assessments independent of ball contact (accuracy 84.5–94.6%) (Bačić, 2018). Multi-stage temporal convolutional networks (MS-TCN) paired with attention-enhanced frequency-domain CNNs achieve shot detection F₁ of 86.0% (passive IMU) (Park et al., 31 Jul 2025).
Biomechanical Feature Fusion: CNN-LSTM models process synchronized RGB video and 3D skeleton time series, extracting joint angles, limb velocities, kinetic chain metrics, trunk rotations, and coordinate impact events for both action classification (test accuracy 79.17%) and expert/novice differentiation (e.g., experts show shorter preparations and higher peak racket speeds) (Dashore et al., 4 Oct 2025).
Trajectory-Only Models: Temporal convolutional networks operating exclusively on detected 2D (or 3D) ball trajectories recognize six stroke types in table tennis with accuracy up to 87.2%, providing a minimally invasive, occlusion-robust solution (Kulkarni et al., 2023).

The table below summarizes select reported performances (all metrics trace to cited sources):

Method/Modality	Reported Accuracy / F₁	Notable Features
SlowFast R50 (video, THETIS)	73.96%	RGB-only, strong spatial/temporal disentangling
Twin 3D CNN + 3D attention (video)	87.3%	RGB+flow, attention blocks, windowed strokes
IMU RBF SVM (dominant wrist)	77.1%	Skill/phase SVM, PCA reduction
IMU CNN + attention (passive wrist)	88.2% (class.), 86.0% (det.)	Frequency-banded, single-device, comfort focus
Marker RBF personalized	84.5–94.6%	Custom coaching labels, error detection
CNN–LSTM w/ biomechanics	79.17%	Joint kinematics, biomechanics, LLM feedback
Ball trajectory TCN (table tennis)	87.2%	2D trajectory only, real-time, minimal data

3. Task Decomposition and Action Taxonomy

Tennis action recognition spans several core sub-tasks, each addressed by different model pipelines:

Stroke-Level Recognition: Fine-grained classification into canonical shot types (flat serve, kick serve, slice serve, forehand/backhand varieties, volleys, smashes). State-of-the-art deep models on curated datasets (THETIS) achieve ~74–87% accuracy, with expert labeling required for subtle distinctions (e.g., slice vs. topspin) (Hovad et al., 2024, Martin et al., 2020, Dashore et al., 4 Oct 2025).
Skill Level and Error Detection: IMU- and marker-based methods classify players as beginner or intermediate, and flag technique errors, with systems flexible to personalized coach criteria (Gao et al., 2024, Bačić, 2018).
Phase Segmentation: Segmentation of individual strokes into biomechanically/temporally distinct phases (backswing, backloop, forward swing, follow-through, recovery), via changepoint detection (e.g., PELT algorithm), followed by SVM or RNN-based phase labeling. ROC-AUC for phase segmentation ranges from 0.74–0.87 across phases (Gao et al., 2024).
Action Detection in Continuous Video: Sliding window-based convnets, transformer-based perception backbones, and explicit segmentation heads (e.g., ViSTec's binary cross-entropy cosine-bump loss) identify shot onsets and ends within untrimmed video, achieving F1@10% of 79.3 on racket-sport segmentation, with comparable success in tennis (He et al., 2024).
Biomechanical Feedback Generation: Numeric features extracted per stroke (joint angles, velocities, timing) are compared to stroke-specific expert-defined ranges, with deviations expressed in relative error and then grounded into LLM-generated actionable language feedback (Dashore et al., 4 Oct 2025).

4. System Architectures and Training Regimens

Architectures are adapted to data modality and operational constraints:

3D ConvNets and Attention Blocks: Late-fusion of spatial (RGB) and temporal (optical flow) streams, enhanced by 3D attention modules gating discriminative spatiotemporal regions, accelerates convergence and improves accuracy by 5–12% over non-attention baselines, at modest parameter cost (<1M total) (Martin et al., 2020).
Transformer and Graph-Based Pipelines: ViSTec's VideoMAE encoder ingests short temporal slices to capture local spatial/temporal features. A contextual knowledge graph representing legal technique transitions biases the classification decision, yielding consistent gains over action segmentation baselines in both F1 and edit score, and is robust to occlusions and long rallies (He et al., 2024).
IMU and Marker-Based Models: Lightweight classifiers based on SVMs, RBF networks, or shallow CNNs deliver efficient real-time inference. Dimensionality reduction (PCA, polynomial fitting) is critical for generalization on small datasets and field deployment (Gao et al., 2024, Bačić, 2018).
Real-Time and Low-Latency Deployment: For on-device applications (e.g., Apple Watch, Samsung Galaxy Watch), models are quantized (e.g., TensorFlow Lite), and power management strategies—such as sliding-window processing and adaptive sampling—are employed to meet computational and energy constraints (Gao et al., 2024, Park et al., 31 Jul 2025).

5. Practical Applications and System Integration

Recognized actions serve a spectrum of downstream applications:

Equitable and Personalized Coaching: Affordable, device-integrated IMU solutions democratize access to tennis coaching, with personalized skill or error feedback adaptable to individual or group instruction (Gao et al., 2024, Bačić, 2018).
Live Match Analytics and Tactical Analysis: Broadcast video models handle continuous untrimmed streams, enabling segmental tagging, rally breakdown, and tactical graph modeling for player evaluation and strategic insights (He et al., 2024, Wu et al., 2022).
Explainable Feedback: Merging biomechanical interpretation with LLM-based language feedback aids in forming actionable, comprehensible advice for technique improvement in both automated and assisted coaching contexts (Dashore et al., 4 Oct 2025).
Low-Burden Wearable Analytics: Passive-arm IMU approaches minimize player discomfort without degrading performance, increasing system acceptability for practical match or practice scenarios (Park et al., 31 Jul 2025).
Future Exergame Platforms: Trajectory-only and marker-agnostic models permit privacy-respecting, on-the-fly segmentation and recognition for gamified or tele-coaching environments (Kulkarni et al., 2023).

6. Limitations, Challenges, and Research Directions

Current systems contend with:

Dataset Scale and Diversity: Most work is conducted on small, class-balanced, static-background datasets (e.g., THETIS, ACASVA), with limited real-world heterogeneity. Expansion to richer multimodal, multi-view, and large-scale corpora is a central open problem (Wu et al., 2022, Hovad et al., 2024).
Fine-Grained Class Imbalance and Ambiguity: Rare shots (e.g., kick serve, smash, drop shot) and visually similar categories present persistent confusion, with confusion matrix analyses showing ~40% recall for some serve types (Hovad et al., 2024, Dashore et al., 4 Oct 2025).
Occlusion and Motion Blur: Rapid ball/player movement, camera motion, and partial occlusion in broadcast video or player clusters impede accurate recognition, especially for distal joints and ball/racket contacts (He et al., 2024).
Sensor Fusion Complexity: Joint use of video, skeleton pose, IMUs, and possibly force/EMG inputs requires robust multimodal fusion and calibration, as well as on-device efficiency (Gao et al., 2024, Dashore et al., 4 Oct 2025).
Real-World Generalization: Transfer learning and domain adaptation from site-labeled datasets to open-world match scenarios remains a major challenge. Synthetic augmentation, self-supervised learning, and transformer-based sequence modeling are current areas of active investigation (Hovad et al., 2024, Wu et al., 2022).
Interpretability and Actionable Feedback: Bridging the gap between black-box predictions and biomechanical or coaching explanations is addressed through explicit feature extraction and LLM-based reporting, yet standard interpretability metrics are rarely deployed (Dashore et al., 4 Oct 2025).

Tennis action recognition shares methodological foundations with other racket sports but presents unique demands:

Compared to table tennis, tennis requires handling larger spatial contexts, higher player mobility, greater court-induced occlusion, and multi-second stroke durations (Wu et al., 2022, He et al., 2024).
Unlike figure skating or diving (where aerial pose and holistic quality are primary), tennis recognition depends more on discrete segmental actions, specific body part tracking (especially wrist and racket), and ball–racket interactions.
The ball remains a critical—but sometimes elusive—cue; trajectory-only methods from table tennis must be extended to 3D and enriched with pose or impact data for lawn tennis (Kulkarni et al., 2023).
Successful systems in tennis are those able to integrate multimodal spatiotemporal information, fuse or prioritize modalities given missing data, and scale to open-world deployment scenarios.

Ongoing research is focused on multimodal synchronization, transfer to broadcast and in-the-wild data, and the development of explainable, context-sensitive, and equitable action recognition for both elite and recreational tennis.