Papers
Topics
Authors
Recent
2000 character limit reached

CoT-AFA: Explainable Action Assessment Dataset

Updated 24 December 2025
  • CoT-AFA is a large-scale, multimodal dataset that provides structured, explainable annotations with chain-of-thought feedback for human performance videos.
  • It employs a hierarchical annotation scheme categorizing workout modes, workout types, and granular action categories to enhance action classification.
  • Combined visual and procedural information supports advanced research in video understanding, action quality assessment, and explainable AI.

CoT-AFA (Chain-of-Thought Action Form Assessment) is a large-scale, multimodal benchmark for explainable action quality and standardization analysis in human performance videos. Designed to address limitations in existing datasets—namely, a lack of granular, causally structured feedback and explicit form assessment—CoT-AFA provides hierarchical activity labels, binary standardization judgments, attribute-level error reasoning, and multi-sentence chain-of-thought (CoT) explanations for over three thousand curated fitness and martial arts video clips. Each annotated sample integrates visual and procedural information, enabling advanced research in video understanding, action classification, and explainable AI for human action assessment (Qi et al., 17 Dec 2025).

1. Dataset Composition and Acquisition

CoT-AFA comprises 3,392 trimmed video clips, captured from both public YouTube sources and self-recorded demonstrations. The dataset covers a total of 364,812 frames, with an average clip duration of approximately 3.5 seconds (∼107 frames at 30 frames per second), summing to a total video span of roughly 11,872 seconds (over 3 hours). All videos are recorded at native resolutions of 720p or 1080p, with frame rates between 24 and 60 fps. Each clip is annotated for its camera viewpoint (Front, Side, Back) to facilitate multi-view analysis. Environments are diverse—home, gym, dojo, studio—with variable lighting and backgrounds, which enhances robustness for computer vision models (Qi et al., 17 Dec 2025).

The domain space comprises two top-level “Workout Modes” (Apparatus, e.g., dumbbell/barbell; and Manual, e.g., aerobics/yoga/martial arts), 28 intermediate “Workout Types” (e.g., “Aerobics” vs. “Yoga,” “Dumbbell” vs. “Barbell”), and 141 granular Action Categories (e.g., “Barbell Bent-Over Row,” “Taiji Push-Hand,” “Sun Salutation Yoga”).

2. Hierarchical Annotation Schema

The CoT-AFA dataset employs a three-tiered lexicon hierarchy for action identification:

Level Classes Example
Workout Mode 2 Apparatus; Manual
Workout Type 28 Dumbbell, Barbell, Aerobics, Yoga, etc.
Action Category 141 Taiji Form 24, Dumbbell Bicep Curl

Every clip is labeled at each level, supporting both coarse-to-fine semantic parsing and fine-grained recognition tasks (Qi et al., 17 Dec 2025).

Additionally, each video receives a binary “Standard”/“Non-Standard” label (S{Standard,Non-Standard}S \in \{\text{Standard}, \text{Non-Standard}\}), totaling 2,242 standard and 1,150 non-standard samples. Non-standard instances are further labeled with an “Error Reason” (e.g., “Back rounded → risk of lumbar strain”), which grounds model outputs and analysis in biomechanical correctness.

3. Chain-of-Thought Explanations

A distinguishing feature of CoT-AFA is the inclusion of Chain-of-Thought (CoT) multi-sentence feedback for every sample. These explanations systematically follow a triadic reasoning template:

  1. Identification of (correct or faulty) action step (e.g., “You raised the barbell too high…”)
  2. Analysis of biomechanical or safety consequence (“…which places undue stress on your lower back.”)
  3. Specific correction or actionable suggestion (“Instead, hinge from the hips and keep your arms to shoulder height.”)

CoT explanations are created through a multi-stage workflow: LLM (Gemini 2.0) prompt generation of “Standard Technical Steps” per action type; video, steps, and quality label are input to a VLM (VideoChat) for initial CoT drafts; automated logical consistency checking by a second VLM (Qwen2.5-VL); and final expert review and editing by eight certified trainers (Qi et al., 17 Dec 2025).

Quantitative properties (means per sample) are provided:

Metric Value
Words per CoT sample 102.19
Sentences per sample 5.25
Vocabulary size 3,143
Causal reasoning steps 0.91
Actionable suggestions 0.75

Logical consistency via VLM pipelines yields near-perfect adherence to templates, while at least 95% of expert-reviewed samples achieved unanimous agreement (formal κ\kappa not reported).

4. Benchmark Tasks, Metrics, and Data Splits

CoT-AFA is structured for multi-task benchmarking:

  • Action Classification (141-way):
    • Top-1 Accuracy (Acc1=1Ni=1N1(y^i(1)=yi)\mathrm{Acc}_1 = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat y_i^{(1)} = y_i))
    • Top-5 Accuracy (analogously, if the ground truth is within the model’s top 5)
  • Form Assessment (Binary):
    • Accuracy (AccQ=1Ni=1N1(q^i=qi)\mathrm{Acc}_Q = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat q_i = q_i))
  • CoT Explanation Generation:

    • Evaluated with BLEU-n, METEOR, CIDEr, and ROUGE-L; CIDEr defined as:

    CIDEr(c,r)=1Mj=1Mg(c)g(rj)g(c)  g(rj)\mathrm{CIDEr}(c,r)=\frac{1}{M}\sum_{j=1}^M \frac{g^{(c)}\cdot g^{(r_j)}}{\lVert g^{(c)}\rVert\;\lVert g^{(r_j)}\rVert}

where g()g^{(\cdot)} are TF–IDF weighted n-gram vectors of candidate cc and references rjr_j.

Recommended split is 70% training (2,374 clips), 15% validation (509), 15% testing (509). Models are trained with a joint multitask objective:

L=λLcls+Lquality+LCoT\mathcal{L} = \lambda\,\mathcal{L}_{\mathrm{cls}} + \mathcal{L}_{\mathrm{quality}} + \mathcal{L}_{\mathrm{CoT}}

with λ\lambda tuned (best at 3.0) (Qi et al., 17 Dec 2025).

5. Annotation Process and Quality Assurance

Annotation integrates LLMs and visual-LLMs in a stepwise manner: LLMs for procedural step prompts, VLMs for draft reasoning, automated validation, and certified human expert review. LLM prompt templates are used to extract standardized action steps per category. An in-house GUI supports efficient expert correction and finalization. This multi-layered pipeline aims to maximize annotation consistency, fluency, and domain accuracy (Qi et al., 17 Dec 2025).

Checks by VLMs yield near-100% logical consistency, and expert reviews report ≥95% unanimous approval, evidence of both formal and substantive reliability.

6. Applications, Resources, and Research Directions

CoT-AFA is specifically constructed for developing and benchmarking explainable action quality assessment frameworks that require joint video, attribute, and procedural text reasoning. Its diverse domains (fitness and martial arts), fine-grained action taxonomy, and explicit causal feedback are uniquely suited to advance the explainability and semantic richness of action assessment systems.

All data and code are available at https://github.com/MICLAB-BUPT/EFA. Annotation scripts and VLM/LLM prompt templates enable repeatability or extension. As of its publication, the dataset established new state-of-the-art benchmarks for explanation generation (+16.0% CIDEr), action classification (+2.7% accuracy), and quality assessment (+2.1% accuracy) when used with novel multimodal architectures such as the Explainable Fitness Assessor, which fuses visual/semantic streams via dynamic gating (Qi et al., 17 Dec 2025).

This suggests that CoT-AFA provides not only a new evaluation resource but a methodological standard for integrating structured causal explanations in human form analysis datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CoT-AFA Dataset.