Classification of Tennis Actions Using Deep Learning

Published 4 Feb 2024 in cs.CV and cs.LG | (2402.02545v1)

Abstract: Recent advances of deep learning makes it possible to identify specific events in videos with greater precision. This has great relevance in sports like tennis in order to e.g., automatically collect game statistics, or replay actions of specific interest for game strategy or player improvements. In this paper, we investigate the potential and the challenges of using deep learning to classify tennis actions. Three models of different size, all based on the deep learning architecture SlowFast were trained and evaluated on the academic tennis dataset THETIS. The best models achieve a generalization accuracy of 74 %, demonstrating a good performance for tennis action classification. We provide an error analysis for the best model and pinpoint directions for improvement of tennis datasets in general. We discuss the limitations of the data set, general limitations of current publicly available tennis data-sets, and future steps needed to make progress.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that the SlowFast 4x16 model achieves 74% accuracy on tennis action recognition using the THETIS RGB dataset.
The paper leverages a dual-stream CNN architecture, using ResNet-50 variants to efficiently capture spatial features and temporal dynamics.
The paper highlights dataset limitations—such as missing ball and court context—and suggests multimodal approaches for enhanced real-world applicability.

Classification of Tennis Actions Using Deep Learning: An Authoritative Analysis

Background and Motivation

The application of deep learning to action recognition in sports video analytics is particularly challenging due to the dense, fine-grained events characteristic of games like tennis. Traditional hand-engineered features, while historically competitive, are highly domain-specific and lack generalizability across datasets with varying scene composition and player skillsets. The SlowFast architecture, previously demonstrated as performant across multiple benchmarks (e.g., Kinetics 400/600, AVA), is leveraged here to interrogate its capacity for domain-specific tennis action recognition using the THETIS RGB dataset – a cornerstone academic benchmark in this field.

Figure 1: Example of a basket ball court background as featured in the THETIS dataset.

The THETIS RGB Dataset: Structure and Limitations

THETIS comprises 1,980 RGB videos from 55 players (predominantly amateur/intermediate), executing 12 distinct tennis actions, where each shot is performed thrice per player. Crucially, these recordings omit ball usage and court context, limiting discriminative cues to player pose and racquet movement. The environments alternate between a basketball court (dynamic background) and a changing room (mirror reflections and minimal movement).

THETIS’s action classes span fine-grained distinctions within serves, forehands, backhands, and volleys. Distribution across classes is balanced but video durations vary (2-5 seconds), potentially introducing variance in intra-class representation.

Figure 2: Number of videos per class in the THETIS dataset arranged by length of the videos.

Model Architecture: SlowFast and its Deployment

SlowFast is a dual-stream CNN architecture where the "slow" pathway samples frames sparsely to maximize spatial feature extraction, and the "fast" pathway samples densely for temporal dynamics. In this study, three ResNet-50-based SlowFast variants (2x32, 4x16, 8x8) are instantiated on THETIS. The 4x16 configuration, with a stride of 16 in the slow path and eightfold denser sampling in the fast path, achieves optimal performance.

Figure 3: SlowFast architecture schematic illustrating spatial and temporal pathway integration.

Empirical Results and Model Comparison

The SlowFast 4x16 model achieves a generalization accuracy of 74% on the THETIS RGB test split. This represents an advance over prior benchmarks: earlier studies report F1 scores of 47% for domain-specific action recognition using baseline CNN-RNN architectures and 60% on depth-based modalities, with no published THETIS RGB results exceeding this mark.

Training metrics show consistent decrease in error across epochs, though a divergence between training and validation curves appears beyond epoch 125, indicative of overfitting mitigated via early stopping at epoch 196.

Figure 4: Learning curves for the SlowFast 4x16 model trained on THETIS RGB, revealing sustained generalization until late epochs.

Model ablation demonstrates that the 2x32 variant fails to converge (test accuracy ~7%), while the 8x8 configuration underperforms relative to 4x16 by 2%, suggesting sensitivity to temporal stride hyperparameters in fine-grained event classification.

Error Analysis: Confusion Matrix and Manual Categorization

Classification performance is heterogeneous across classes: accuracy varies from 38% to 100%. Serve variants (flat, slice, kick) and shots with subtle movement differences (slice vs volley, smash vs serve) exhibit high confusion rates, reflecting the dataset’s absence of ball trajectory cues and court position information.

Figure 5: Confusion matrix for SlowFast 4x16 on THETIS RGB, highlighting inter-class confusion and variable per-class accuracy.

Error categorization shows 44.4% of misclassifications are due to serve confusions, 20.4% to slice/volley ambiguity, and 16.7% to smash/serve overlap. Beginner skill-level is implicated in 9.3% of errors. These results are consistent with the hypothesis that missing environmental cues – ball, court position – critically impair discriminative power at the model and human levels.

Practical Implications and Theoretical Considerations

This study substantiates the efficacy of the SlowFast architecture for domain-specific tennis action recognition when provided with well-balanced, albeit limited, video data. The strong improvement over prior RGB-based benchmarks suggests architectural advances in spatio-temporal modeling are pivotal, yet incomplete in the absence of context-rich inputs.

The observed challenges emphasize the necessity of dataset completeness: action recognition in tennis, when stripped of environmental and object cues (ball trajectory, court location), reduces to ambiguous pose-based distinctions. For deployment in real-world analytics (automated match statistics, coaching feedback), future datasets should incorporate high-resolution video from actual matches, ball tracking, and spatial localization.

Speculation on Future Directions

Further research should focus on capturing multimodal data (RGB, depth, skeletal, object tracking) in professional match contexts with dense event annotation. Transfer learning from THETIS-trained SlowFast models could seed initial weights for larger, real-world datasets. Incorporation of self-supervised and contrastive learning paradigms on event-rich tennis datasets may unlock superior generalization on unseen player actions.

Synthetic data generation (simulated ball and player movement) may mitigate the domain gap between academic datasets and broadcast matches. Hierarchical architectures combining low-level pose imitation with high-level event planning, as explored in Vid2Player3D, could further extend the modeling capacity for analytic applications.

Conclusion

Deep learning-based video classification, exemplified by the SlowFast architecture, demonstrates robust performance on domain-specific tennis action recognition in the THETIS RGB dataset, achieving 74% accuracy and outperforming prior benchmarks. However, critical dataset limitations – absence of ball/court context and reliance on amateur performance – constrain practical application. Advancing tennis video analytics necessitates comprehensive, high-fidelity datasets and continual architectural innovation to capture and distinguish the subtle spatio-temporal cues integral to fine-grained sporting events.

Markdown