Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning (2405.20606v2)

Published 31 May 2024 in cs.CV, cs.AI, cs.LG, and cs.MM

Abstract: Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C$^2$VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our method outperforms the previous methods and achieves state-of-the-art results. Code is available at: https://github.com/cseeyangchen/C2VL.

Authors (7)

Tian He (17 papers)
Junfeng Fu (1 paper)
Ling Wang (89 papers)
Jingcai Guo (48 papers)
Hong Cheng (74 papers)
Yang Chen (535 papers)
Ting Hu (23 papers)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/gm8xx8/status/1797483945006866793

Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning (2405.20606v2)

Summary

Related Papers

Tweets