Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval (2407.02104v1)
Abstract: Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously - together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process sequences of skeleton data. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods.
- A Spatio-temporal Transformer for 3D Human Motion Prediction. arXiv (2020). https://doi.org/10.48550/ARXIV.2004.08692
- ViViT: A Video Vision Transformer. In IEEE/CVF International Conference on Computer Vision (ICCV). 6836–6846.
- Language Models are Few-Shot Learners.
- Efficient Indexing of 3D Human Motions. In International Conference on Multimedia Retrieval (ICMR). ACM, 10–18. https://dl.acm.org/doi/10.1145/3460426.3463646
- A Dense-Sparse Complementary Network for Human Action Recognition based on RGB and Skeleton Modalities. Expert Systems with Applications 244 (2024), 1–16. https://doi.org/10.1016/j.eswa.2023.123061
- Hierarchical Transformer: Unsupervised Representation Learning for Skeleton-Based Human Action Recognition. In IEEE International Conference on Multimedia and Expo (ICME). 1–6. https://doi.org/10.1109/ICME51207.2021.9428459
- Motion-Transformer: Self-Supervised Pre-Training for Skeleton-Based Action Recognition. In 2nd ACM International Conference on Multimedia in Asia (MMAsia). ACM, New York, NY, USA. https://doi.org/10.1145/3444685.3446289
- DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition. arXiv (2022). https://doi.org/10.48550/ARXIV.2210.05895
- Shradha Dubey and Manish Dixit. 2022. A comprehensive survey on human pose estimation approaches. Multimedia Systems (2022), 1–29. https://doi.org/10.1007/s00530-022-00980-0
- Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021).
- HAAN: Human Action Aware Network for Multi-Label Temporal Action Detection. In 31st ACM International Conference on Multimedia (MM). ACM, New York, NY, USA, 5059–5069. https://doi.org/10.1145/3581783.3612097
- Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1396–1406.
- Generating Diverse and Natural 3D Human Motions From Text. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5152–5161.
- TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In European Conference on Computer Vision (ECCV). Springer Nature Switzerland, Cham, 580–597.
- Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia. 2021–2029.
- Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
- PESTA: An Elastic Motion Capture Data Retrieval Method. Journal of Computer Science and Technology 38 (2023), 867–884. https://doi.org/10.1007/s11390-023-3140-y
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
- Learning Joint Representation of Human Motion and Language. arXiv preprint arXiv:2210.15187 (2022).
- Human Motion Aware Text-to-Video Generation With Explicit Camera Control. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 5081–5090.
- An introduction to variational autoencoders. Foundations and Trends® in Machine Learning 12, 4 (2019), 307–392.
- Modality Mixer for Multi-Modal Action Recognition. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 3298–3307.
- MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition. In 28th ACM International Conference on Multimedia (MM). ACM, New York, NY, USA, 2490–2498. https://doi.org/10.1145/3394171.3413548
- SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866.
- CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304.
- Deep Hashing for Motion Capture Data Retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2215–2219. https://doi.org/10.1109/ICASSP39728.2021.9413505
- AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision. 5442–5451.
- Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 4 (2021), 1–23.
- Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching. arXiv preprint arXiv:2206.10436 (2022).
- Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language. In 46th International Conference on Research and Development in Information Retrieval (SIGIR). ACM, 2420–2425. https://doi.org/10.1145/3539618.3592069 Best Paper Award Honorable Mention.
- ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing. 64–70.
- Two-Stage RGB-Based Action Detection Using Augmented 3D Poses. In 18th International Conference on Computer Analysis of Images and Patterns (CAIP), Vol. 11678. Springer, 26–35. https://doi.org/10.1007/978-3-030-29888-3_3
- Action-Conditioned 3D Human Motion Synthesis With Transformer VAE. In IEEE/CVF International Conference on Computer Vision (ICCV). 10985–10995.
- TEMOS: Generating Diverse Human Motions from Textual Descriptions. In European Conference on Computer Vision (ECCV). Springer Nature Switzerland, Cham, 480–497.
- TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis. In IEEE/CVF International Conference on Computer Vision (ICCV). 9488–9497.
- The KIT Motion-Language Dataset. Big Data 4, 4 (2016), 236–252. https://doi.org/10.1089/big.2016.0028
- Learning Transferable Visual Models From Natural Language Supervision. arXiv (2021). https://doi.org/10.48550/ARXIV.2103.00020
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval. In 45th European Conference on Information Retrieval (ECIR). Springer, Cham, 110–124. https://doi.org/10.1007/978-3-031-28238-6_8
- Content-based Management of Human Motion Data: Survey and Challenges. IEEE Access 9 (2021), 64241–64255. https://doi.org/10.1109/ACCESS.2021.3075766
- From CNNs to Transformers in Multimodal Human Action Recognition: A Survey. ACM Transactions on Multimedia Computing, Communications, and Applications (2024). https://doi.org/10.1145/3664815
- Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20020–20029.
- DVANet: Disentangling View and Action Features for Multi-View Action Recognition. arXiv:2312.05719
- Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems 33 (2020), 16857–16867.
- Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection. IEEE Transactions on Image Processing 27, 7 (2018), 3459–3471. https://doi.org/10.1109/TIP.2018.2818328
- Joints-Centered Spatial-Temporal Features Fused Skeleton Convolution Network for Action Recognition. IEEE Transactions on Multimedia (2023), 1–15. https://doi.org/10.1109/TMM.2023.3324835
- Unified Multi-Modal Unsupervised Representation Learning for Skeleton-Based Action Understanding. In 31st ACM International Conference on Multimedia (MM). ACM, 2973–2984. https://doi.org/10.1145/3581783.3612449
- Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-Based Action Recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 2 (2022), 1–24. https://doi.org/10.1145/3472722
- Master Motor Map (MMM)—Framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In 2014 IEEE-RAS International Conference on Humanoid Robots. IEEE, 894–901.
- Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations.
- Spatial-temporal graph transformer network for skeleton-based temporal action segmentation. Multimedia Tools and Applications (2023), 1–25. https://doi.org/10.1007/s11042-023-17276-8
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Self-Supervised 3D Skeleton Representation Learning with Active Sampling and Adaptive Relabeling for Action Recognition. In IEEE International Conference on Image Processing (ICIP). 56–60. https://doi.org/10.1109/ICIP49359.2023.10221961
- Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning. In 31st ACM International Conference on Multimedia (MM). ACM, 5339–5347. https://doi.org/10.1145/3581783.3612490
- Language Guided Graph Transformer for Skeleton Action Recognition. In Neural Information Processing. Springer Nature Singapore, Singapore, 283–299.
- Cross-Modal Retrieval for Motion and Text via DropTriple Loss. In 5th ACM International Conference on Multimedia in Asia (MMAsia). Association for Computing Machinery, New York, NY, USA, 1–7. https://doi.org/10.1145/3595916.3626459
- Motion Guided Attention Learning for Self-Supervised 3D Human Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology (2022), 1–13. https://doi.org/10.1109/TCSVT.2022.3194350
- T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. arXiv (2023), 1–14. https://doi.org/10.48550/ARXIV.2301.06052
- MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv (2022), 1–16. https://doi.org/10.48550/ARXIV.2208.15001
- Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5579–5588.
- Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020).
- On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5745–5753.
- LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. In International Conference on Learning Representations (ICLR).
- Mu Zhu. 2004. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo 2, 30 (2004), 6.
- Temporal Refinement Graph Convolutional Network for Skeleton-based Action Recognition. IEEE Transactions on Artificial Intelligence (2023), 1–14. https://doi.org/10.1109/TAI.2023.3329799
- Nicola Messina (23 papers)
- Jan Sedmidubsky (3 papers)
- Fabrizio Falchi (58 papers)
- Tomáš Rebok (3 papers)