Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval (2407.02104v1)

Published 2 Jul 2024 in cs.CV, cs.IR, and cs.MM

Abstract: Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously - together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process sequences of skeleton data. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. A Spatio-temporal Transformer for 3D Human Motion Prediction. arXiv (2020). https://doi.org/10.48550/ARXIV.2004.08692
  2. ViViT: A Video Vision Transformer. In IEEE/CVF International Conference on Computer Vision (ICCV). 6836–6846.
  3. Language Models are Few-Shot Learners.
  4. Efficient Indexing of 3D Human Motions. In International Conference on Multimedia Retrieval (ICMR). ACM, 10–18. https://dl.acm.org/doi/10.1145/3460426.3463646
  5. A Dense-Sparse Complementary Network for Human Action Recognition based on RGB and Skeleton Modalities. Expert Systems with Applications 244 (2024), 1–16. https://doi.org/10.1016/j.eswa.2023.123061
  6. Hierarchical Transformer: Unsupervised Representation Learning for Skeleton-Based Human Action Recognition. In IEEE International Conference on Multimedia and Expo (ICME). 1–6. https://doi.org/10.1109/ICME51207.2021.9428459
  7. Motion-Transformer: Self-Supervised Pre-Training for Skeleton-Based Action Recognition. In 2nd ACM International Conference on Multimedia in Asia (MMAsia). ACM, New York, NY, USA. https://doi.org/10.1145/3444685.3446289
  8. DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition. arXiv (2022). https://doi.org/10.48550/ARXIV.2210.05895
  9. Shradha Dubey and Manish Dixit. 2022. A comprehensive survey on human pose estimation approaches. Multimedia Systems (2022), 1–29. https://doi.org/10.1007/s00530-022-00980-0
  10. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021).
  11. HAAN: Human Action Aware Network for Multi-Label Temporal Action Detection. In 31st ACM International Conference on Multimedia (MM). ACM, New York, NY, USA, 5059–5069. https://doi.org/10.1145/3581783.3612097
  12. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1396–1406.
  13. Generating Diverse and Natural 3D Human Motions From Text. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5152–5161.
  14. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In European Conference on Computer Vision (ECCV). Springer Nature Switzerland, Cham, 580–597.
  15. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia. 2021–2029.
  16. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446.
  17. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
  18. PESTA: An Elastic Motion Capture Data Retrieval Method. Journal of Computer Science and Technology 38 (2023), 867–884. https://doi.org/10.1007/s11390-023-3140-y
  19. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
  20. Learning Joint Representation of Human Motion and Language. arXiv preprint arXiv:2210.15187 (2022).
  21. Human Motion Aware Text-to-Video Generation With Explicit Camera Control. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 5081–5090.
  22. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning 12, 4 (2019), 307–392.
  23. Modality Mixer for Multi-Modal Action Recognition. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 3298–3307.
  24. MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition. In 28th ACM International Conference on Multimedia (MM). ACM, New York, NY, USA, 2490–2498. https://doi.org/10.1145/3394171.3413548
  25. SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866.
  26. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304.
  27. Deep Hashing for Motion Capture Data Retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2215–2219. https://doi.org/10.1109/ICASSP39728.2021.9413505
  28. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision. 5442–5451.
  29. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 4 (2021), 1–23.
  30. Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching. arXiv preprint arXiv:2206.10436 (2022).
  31. Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language. In 46th International Conference on Research and Development in Information Retrieval (SIGIR). ACM, 2420–2425. https://doi.org/10.1145/3539618.3592069 Best Paper Award Honorable Mention.
  32. ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing. 64–70.
  33. Two-Stage RGB-Based Action Detection Using Augmented 3D Poses. In 18th International Conference on Computer Analysis of Images and Patterns (CAIP), Vol. 11678. Springer, 26–35. https://doi.org/10.1007/978-3-030-29888-3_3
  34. Action-Conditioned 3D Human Motion Synthesis With Transformer VAE. In IEEE/CVF International Conference on Computer Vision (ICCV). 10985–10995.
  35. TEMOS: Generating Diverse Human Motions from Textual Descriptions. In European Conference on Computer Vision (ECCV). Springer Nature Switzerland, Cham, 480–497.
  36. TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis. In IEEE/CVF International Conference on Computer Vision (ICCV). 9488–9497.
  37. The KIT Motion-Language Dataset. Big Data 4, 4 (2016), 236–252. https://doi.org/10.1089/big.2016.0028
  38. Learning Transferable Visual Models From Natural Language Supervision. arXiv (2021). https://doi.org/10.48550/ARXIV.2103.00020
  39. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  40. SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval. In 45th European Conference on Information Retrieval (ECIR). Springer, Cham, 110–124. https://doi.org/10.1007/978-3-031-28238-6_8
  41. Content-based Management of Human Motion Data: Survey and Challenges. IEEE Access 9 (2021), 64241–64255. https://doi.org/10.1109/ACCESS.2021.3075766
  42. From CNNs to Transformers in Multimodal Human Action Recognition: A Survey. ACM Transactions on Multimedia Computing, Communications, and Applications (2024). https://doi.org/10.1145/3664815
  43. Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20020–20029.
  44. DVANet: Disentangling View and Action Features for Multi-View Action Recognition. arXiv:2312.05719
  45. Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems 33 (2020), 16857–16867.
  46. Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection. IEEE Transactions on Image Processing 27, 7 (2018), 3459–3471. https://doi.org/10.1109/TIP.2018.2818328
  47. Joints-Centered Spatial-Temporal Features Fused Skeleton Convolution Network for Action Recognition. IEEE Transactions on Multimedia (2023), 1–15. https://doi.org/10.1109/TMM.2023.3324835
  48. Unified Multi-Modal Unsupervised Representation Learning for Skeleton-Based Action Understanding. In 31st ACM International Conference on Multimedia (MM). ACM, 2973–2984. https://doi.org/10.1145/3581783.3612449
  49. Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-Based Action Recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 2 (2022), 1–24. https://doi.org/10.1145/3472722
  50. Master Motor Map (MMM)—Framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In 2014 IEEE-RAS International Conference on Humanoid Robots. IEEE, 894–901.
  51. Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations.
  52. Spatial-temporal graph transformer network for skeleton-based temporal action segmentation. Multimedia Tools and Applications (2023), 1–25. https://doi.org/10.1007/s11042-023-17276-8
  53. Attention is all you need. Advances in neural information processing systems 30 (2017).
  54. Self-Supervised 3D Skeleton Representation Learning with Active Sampling and Adaptive Relabeling for Action Recognition. In IEEE International Conference on Image Processing (ICIP). 56–60. https://doi.org/10.1109/ICIP49359.2023.10221961
  55. Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning. In 31st ACM International Conference on Multimedia (MM). ACM, 5339–5347. https://doi.org/10.1145/3581783.3612490
  56. Language Guided Graph Transformer for Skeleton Action Recognition. In Neural Information Processing. Springer Nature Singapore, Singapore, 283–299.
  57. Cross-Modal Retrieval for Motion and Text via DropTriple Loss. In 5th ACM International Conference on Multimedia in Asia (MMAsia). Association for Computing Machinery, New York, NY, USA, 1–7. https://doi.org/10.1145/3595916.3626459
  58. Motion Guided Attention Learning for Self-Supervised 3D Human Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology (2022), 1–13. https://doi.org/10.1109/TCSVT.2022.3194350
  59. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. arXiv (2023), 1–14. https://doi.org/10.48550/ARXIV.2301.06052
  60. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv (2022), 1–16. https://doi.org/10.48550/ARXIV.2208.15001
  61. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5579–5588.
  62. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020).
  63. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5745–5753.
  64. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. In International Conference on Learning Representations (ICLR).
  65. Mu Zhu. 2004. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo 2, 30 (2004), 6.
  66. Temporal Refinement Graph Convolutional Network for Skeleton-based Action Recognition. IEEE Transactions on Artificial Intelligence (2023), 1–14. https://doi.org/10.1109/TAI.2023.3329799
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Nicola Messina (23 papers)
  2. Jan Sedmidubsky (3 papers)
  3. Fabrizio Falchi (58 papers)
  4. Tomáš Rebok (3 papers)