FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment (2405.06887v1)
Abstract: Existing action quality assessment (AQA) methods mainly learn deep representations at the video level for scoring diverse actions. Due to the lack of a fine-grained understanding of actions in videos, they harshly suffer from low credibility and interpretability, thus insufficient for stringent applications, such as Olympic diving events. We argue that a fine-grained understanding of actions requires the model to perceive and parse actions in both time and space, which is also the key to the credibility and interpretability of the AQA technique. Based on this insight, we propose a new fine-grained spatial-temporal action parser named \textbf{FineParser}. It learns human-centric foreground action representations by focusing on target action regions within each frame and exploiting their fine-grained alignments in time and space to minimize the impact of invalid backgrounds during the assessment. In addition, we construct fine-grained annotations of human-centric foreground action masks for the FineDiving dataset, called \textbf{FineDiving-HM}. With refined annotations on diverse target action procedures, FineDiving-HM can promote the development of real-world AQA systems. Through extensive experiments, we demonstrate the effectiveness of FineParser, which outperforms state-of-the-art methods while supporting more tasks of fine-grained action understanding. Data and code are available at \url{https://github.com/PKU-ICST-MIPL/FineParser_CVPR2024}.
- Frequency-tuned salient region detection. In CVPR, pages 1597–1604, 2009.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.
- Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR, pages 10638–10647, 2020.
- Sportscap: Monocular 3d human motion capture and fine-grained understanding in challenging sports videos. arXiv preprint arXiv:2104.11452, 2021.
- Attention-based context aware reasoning for situation recognition. In CVPR, pages 4736–4745, 2020.
- Sportsmot: A large multi-object tracking dataset in multiple sports scenes. In ICCV, pages 9921–9931, 2023.
- How do you do it? fine-grained action understanding with pseudo-adverbs. In CVPR, pages 13832–13842, 2022.
- Structure-measure: A new way to evaluate foreground maps. In ICCV, pages 4548–4557, 2017.
- Three-dimensional reconstruction of human interactions. In CVPR, pages 7214–7223, 2020.
- Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. In CVPR, pages 19999–20009, 2022.
- A dataset for medical instructional video classification and question answering. Scientific Data, 10(1):158, 2023.
- Populating 3d scenes by learning human-scene interaction. In CVPR, pages 14708–14718, 2021.
- Video pose distillation for few-shot, fine-grained sports action recognition. In ICCV, pages 9254–9263, 2021.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In ICCV, pages 13536–13545, 2021.
- Weakly-supervised temporal action detection for fine-grained videos with hierarchical atomic actions. In ECCV, pages 567–584, 2022.
- Focal loss for dense object detection. In ICCV, 2017.
- Fineaction: A fine-grained video dataset for temporal action localization. arXiv preprint arXiv:2105.11107, 2021.
- Multi-modal domain adaptation for fine-grained action recognition. In CVPR, pages 122–132, 2020.
- One shot learning for video object segmentation using fully convolutional i3d network. 2018.
- You2me: Inferring body pose in egocentric video via first and second person interactions. In CVPR, pages 9890–9900, 2020.
- Action assessment by joint relation graphs. In ICCV, pages 6331–6340, 2019.
- Human motion assessment in real time using recurrent self-organization. In RO-MAN, pages 71–76, 2016.
- Action quality assessment across multiple actions. In WACV, pages 1468–1476, 2019a.
- What and how well you performed? a ma multitask learningultitask learning approach to action quality assessment. In CVPR, pages 304–313, 2019b.
- Learning to score olympic events. In CVPRW, pages 20–28, 2017.
- Saliency filters: Contrast based filtering for salient region detection. In CVPR, pages 733–740, 2012.
- Fine-grained activity recognition in baseball videos. In CVPRW, pages 1740–1748, 2018.
- Assessing the quality of actions. In ECCV, pages 556–571, 2014.
- Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, pages 2616–2625, 2020.
- Uncertainty-aware score distribution learning for action quality assessment. In CVPR, pages 9839–9848, 2020.
- Areas of research focus and trends in the research on the application of vr in rehabilitation medicine. In Healthcare, page 2056, 2023.
- Synthesizing long-term 3d human motion and interaction in 3d scenes. In CVPR, pages 9401–9411, 2021a.
- Tsa-net: Tube self-attention network for action quality assessment. In ACM MM, pages 4902–4910, 2021b.
- Learning to score figure skating sport videos. TCSVT, 30(12):4578–4590, 2019.
- Finediving: A fine-grained dataset for procedure-aware action quality assessment. In CVPR, pages 2949–2958, 2022.
- Group-aware contrastive regression for action quality assessment. In ICCV, pages 7919–7928, 2021.
- Anetqa: A large-scale benchmark for fine-grained compositional reasoning over untrimmed videos. In CVPR, pages 23191–23200, 2023.
- Temporal query networks for fine-grained video understanding. In CVPR, pages 4486–4496, 2021.
- Modeling video as stochastic processes for fine-grained video representation learning. In CVPR, pages 2225–2234, 2023a.
- Logo: A long-form video dataset for group action quality assessment. In CVPR, pages 2405–2414, 2023b.
- Fine-grained video categorization with redundancy reduction attention. In ECCV, pages 136–152, 2018.