Tri-Modal Motion Retrieval by Learning a Joint Embedding Space (2403.00691v1)
Abstract: Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modality can enrich a model's application scenario, and more importantly, an adequate choice of the extra modality can also act as an intermediary and enhance the alignment between the other two disparate modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion alignment), a novel framework for three-modality learning integrating human-centric videos as an additional modality, thereby effectively bridging the gap between text and motion. Moreover, our approach leverages a specially designed attention mechanism to foster enhanced alignment and synergistic effects among text, video, and motion modalities. Empirically, our results on the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks, including text-to-motion, motion-to-text, video-to-motion and motion-to-video.
- See, hear, and read: Deep aligned representations. CoRR, abs/1706.00932, 2017.
- Drecon: data-driven responsive control of physics-based characters. ACM TOG, 2019.
- Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345, 2023.
- Simon Clavet. Motion matching and the road to next-gen animation. In Proc. of GDC, 2016.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPR workshops, 2020.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, 2018.
- Biomechanical validation of upper-body and lower-body joint movements of kinect motion capture data for rehabilitation treatments. In 2012 fourth international conference on intelligent networking and collaborative systems, 2012.
- Early vs late fusion in multimodal convolutional neural networks. In FUSION, 2020.
- Calic: Accurate and efficient image-text retrieval via contrastive alignment and visual contexts modeling. In ACM MM, 2022.
- Generative adversarial nets. NeurIPS, 2014.
- Affect recognition from face and body: early fusion vs. late fusion. In 2005 IEEE international conference on systems, man and cybernetics, 2005.
- Action2motion: Conditioned generation of 3d human motions. In ACM MM, 2020.
- Generating diverse and natural 3d human motions from text. In CVPR, 2022a.
- Generating diverse and natural 3d human motions from text. In CVPR, 2022b.
- Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In ECCV, 2022c.
- Viton: An image-based virtual try-on network. In CVPR, 2018.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Phase-functioned neural networks for character control. ACM TOG, 2017.
- Expectation-maximization contrastive learning for compact video-and-language representations. NeurIPS, 2022.
- Efficient motion retrieval in large motion databases. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, 2013.
- PAQ: 65 million probably-asked questions and what you can do with them. CoRR, 2021.
- SMPL: A skinned multi-person linear model. SIGGRAPH Asia, 2015.
- Amass: Archive of motion capture as surface shapes. In ICCV, 2019.
- Text-to-motion retrieval: Towards joint understanding of human motion data and natural language. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2420–2425, 2023.
- Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pages 19–27, 2018.
- Deep learning for content-based video retrieval in film and television production. Multimedia Tools and Applications, 2017.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Automatic differentiation in pytorch. 2017.
- Temos: Generating diverse human motions from textual descriptions. In ECCV, 2022.
- TMR: Text-to-motion retrieval using contrastive 3D human motion synthesis. In ICCV, 2023.
- The KIT motion-language dataset. Big Data, 2016.
- Babel: Bodies, action and behavior with english labels. In CVPR, 2021.
- Modi: Unconditional motion synthesis from diverse data. In CVPR, 2023.
- Learning transferable visual models from natural language supervision. CoRR, 2021.
- Zero-shot text-to-image generation. CoRR, 2021a.
- Zero-shot text-to-image generation. In ICML, 2021b.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Video-based human motion capture data retrieval via motionset network. IEEE Access, 8:186212–186221, 2020.
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, 2019.
- Motion generation using bilateral control-based imitation learning with autoregressive learning. IEEE Access, 2021.
- End-to-end generative pretraining for multimodal video captioning. In CVPR, 2022.
- Early versus late fusion in semantic video analysis. In ACM MM, 2005.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Motionclip: Exposing human motion generation to clip space. In ECCV, 2022a.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022b.
- Edge: Editable dance generation from music. In CVPR, 2023.
- Attention is all you need. NeurIPS, 2017.
- Coke: Contextualized knowledge graph embedding. arXiv preprint arXiv:1911.02168, 2019.
- Large-scale multi-modal pre-trained models: A comprehensive survey, 2023a.
- Fg-t2m: Fine-grained text-driven human motion generation via diffusion model. In ICCV, 2023b.
- Adaptive multi-view feature selection for human motion retrieval. Signal Processing, 2016.
- Accurate realtime full-body motion capture using a single depth camera. ACM TOG, 2012.
- Cap4video: What can auxiliary captions do for text-video retrieval? In CVPR, pages 10704–10713, 2023a.
- Revisiting classifier: Transferring vision-language models for video recognition. In Proceedings of the AAAI conference on artificial intelligence, pages 2847–2855, 2023b.
- Convolutional sequence generation for skeleton-based action synthesis. In ICCV, 2019.
- Physdiff: Physics-guided human motion diffusion model. In ICCV, 2023.
- T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052, 2023.
- Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
- Bayesian adversarial human motion synthesis. In CVPR, 2020.
- Deep learning-based human pose estimation: A survey. ACM Computing Surveys, 2023.
- Human motion generation: A survey, 2023.
- Music2dance: Dancenet for music-driven dance generation. ACM TOMM, 2022.
- Kangning Yin (4 papers)
- Shihao Zou (17 papers)
- Yuxuan Ge (2 papers)
- Zheng Tian (23 papers)