Shot2Tactic-Caption Dataset for Badminton Analysis

Updated 18 October 2025

The Shot2Tactic-Caption Dataset is a multi-scale badminton video captioning resource that provides both shot-level and tactic-level annotations for in-depth tactical analysis.
It includes 5,494 shot captions and 544 tactic captions, enabling models to learn both local actions and broader strategic play through detailed temporal segmentation.
The dataset uses a shot-wise prompt guidance and dual-branch architecture for precise temporal alignment and semantic conditioning in caption generation.

The Shot2Tactic-Caption Dataset is the first annotated multi-scale badminton video captioning dataset devised for the semantic and temporal analysis of tactical execution. Embedded within the Shot2Tactic-Caption framework (Ding et al., 16 Oct 2025), it provides both fine-grained and high-level natural language descriptions, enabling quantitative evaluation and training of models that generate semantic captions at the shot and tactic levels. The dataset contains 5,494 shot captions and 544 tactic captions, serving as a critical resource for multi-scale temporal reasoning and tactical understanding in badminton video analysis.

1. Dataset Structure and Annotation Protocols

The Shot2Tactic-Caption Dataset comprises two principal caption categories:

Shot-Level Captions: Short textual descriptions encapsulating individual shots (each approximately 0.7 seconds in duration). These captions focus on local actions such as net play, backhand flick, or smash.
Tactic-Level Captions: Longer textual segments describing how a sequence of shots collectively constitutes a tactical unit. Each tactic caption captures the temporal semantics as well as the evolving nature of strategic execution.

Annotation protocols involve segmenting the raw video footage into discrete shots using T-DEED, followed by grouping contiguous shots into candidate tactic units (typically in windows of 5, 7, or 9 shots). Subsequently, human annotators assign shot and tactic captions. Tactic-level annotation further specifies tactic type (from 9 classes, e.g., “Continuous Smashing”) and tactic state (from 5 states: “Start”, “Continue”, “Interrupt”, “Resume”, “Finish”), reflecting a dynamic stratification inspired by control process modeling.

Caption Type	Number of Instances	Temporal Scope
Shot Caption	5,494	~0.7 seconds (single shot)
Tactic Caption	544	Multiple shots (5–9 window)

2. Integration Within the Shot2Tactic-Caption Framework

The dataset is tightly coupled with the Shot2Tactic-Caption framework. Its shot captions provide supervision for the network’s shot branch, while tactic captions—with associated tactic type and state labels—enable supervised training of the tactic branch and the Tactic Unit Detector.

The segmentation and annotation align with the dual-branch architecture: shot-level captions optimize recognition of temporally-local events, and tactic-level captions anchor temporal and semantic reasoning over variable-length shot sequences. This design supports modular ablations and facilitates direct performance comparison across both granularities.

3. Tactic Unit Detector and State Annotation

Each candidate tactic unit within the dataset is classified by a binary Tactic Unit Detector to determine if it is a valid tactical segment. Valid units are then annotated with:

Tactic Type: One of 9 predefined classes (e.g., “Continuous Smashing”, “Flick Serve Attack”).
Tactic State: Categorical labels denoting dynamic progression, including “Start”, “Continue”, “Interrupt”, “Resume”, and “Finish”. This models the non-linear, occasionally interrupted trajectories of tactical play.

The labels are used as supervised signals in auxiliary classifiers and as structured prompts for downstream captioning models, enabling explicit semantic conditioning and temporal alignment.

4. Shot-wise Prompt-Guided Mechanism

A distinctive aspect of dataset usage is the shot-wise prompt-guided mechanism for tactic captioning. For each tactic unit, individual shot states and overall tactic type are encoded into structured prompts of the form:

Prompt i: <TacticType> -- <Stateᵢ>

These temporally ordered prompts, enriched with positional encoding, are injected into the Transformer decoder via cross-attention. Such conditioning, directly supported by the dataset structure, allows models to generate captions that mirror the temporal progression and interruptions within tactics, rather than producing static summaries. Ablation studies employing prompt structuring rooted in the dataset annotations demonstrate substantial improvements in coherence and metric performance.

5. Benchmarking, Evaluation Metrics, and Ablation Studies

The Shot2Tactic-Caption Dataset facilitates rigorous benchmarking of multi-scale captioning architectures. Reported evaluation metrics include BLEU-4, METEOR, CIDEr, and Precision, enabling comprehensive assessment at both shot and tactic levels:

Shot Captioning: The ResNet50-based spatio-temporal encoder achieves BLEU-4 of 48.99 and METEOR of 70.78.
Tactic Captioning: BLEU-4 of 64.57, METEOR of 79.63, with CIDEr and Precision benefiting notably from prompt-driven structure.

Ablation studies leverage the dataset to compare encoder–decoder variants, prompt structuring methods (no prompt, flat prompt, shot-wise prompt), and Tactic Unit Detector configurations, with clear advances attributed to temporal prompt alignment and fine-grained annotations.

6. Loss Functions and Training Regime

Dataset annotations supervise model components using loss functions tailored to the respective tasks:

Shot Captioning Loss: $L_{sc} = -\sum_t \log p(y_t | y_{<t}, H^{(shot)})$
Tactic Captioning Loss: $L_{tc} = -\sum_t \log p(y_t | y_{<t}, H^{(tactic)})$
Combined Loss: $L_{total} = \lambda_{sc} \cdot L_{sc} + \lambda_{tc} \cdot L_{tc}$ with $\lambda_{sc}=0.3$ , $\lambda_{tc}=6$ .

The Tactic Unit Detector is trained with a composite loss integrating focal and margin losses for segment validity, and cross-entropy/focal losses for tactic type and state prediction. These loss signals enable end-to-end optimization of multi-scale captioning and tactical recognition directly from dataset annotations.

7. Applications and Extension Prospects

By enabling multi-scale supervised captioning, the Shot2Tactic-Caption Dataset advances automated tactical commentary, tactical information retrieval, real-time visual question answering, and empirical performance analysis in racket sports. Its structure and annotation stratagem provide a template for extension to additional sports and domains requiring semantic temporal video understanding and tactical narrative synthesis.

A plausible implication is that fine-grained, temporally-aligned annotation frameworks, as exemplified by this dataset, are instrumental in bridging the gap between low-level event detection and high-level strategic reasoning, orienting future research in multi-modal sports analysis toward more sophisticated temporal context integration and narrative generation.

PDF Markdown Chat (Pro)

References (1)

Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Shot2Tactic-Caption Dataset.