TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis (2305.00976v2)

Published 2 May 2023 in cs.CV and cs.CL

Abstract: In this paper, we present TMR, a simple yet effective approach for text to 3D human motion retrieval. While previous work has only treated retrieval as a proxy evaluation metric, we tackle it as a standalone task. Our method extends the state-of-the-art text-to-motion synthesis model TEMOS, and incorporates a contrastive loss to better structure the cross-modal latent space. We show that maintaining the motion generation loss, along with the contrastive training, is crucial to obtain good performance. We introduce a benchmark for evaluation and provide an in-depth analysis by reporting results on several protocols. Our extensive experiments on the KIT-ML and HumanML3D datasets show that TMR outperforms the prior work by a significant margin, for example reducing the median rank from 54 to 19. Finally, we showcase the potential of our approach on moment retrieval. Our code and models are publicly available at https://mathis.petrovich.fr/tmr.

Authors (3)

Mathis Petrovich (10 papers)
Michael J. Black (163 papers)
Gül Varol (39 papers)

Citations (56)

View on Semantic Scholar

Summary

Insights on Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis

The paper "TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis" presents a novel approach to text-to-motion retrieval, focusing on enhancing the retrieval task through contrastive learning methods. The authors, Mathis Petrovich, Michael J. Black, and Gül Varol, build upon the existing text-to-motion synthesis model TEMOS and introduce a contrastive loss to improve the structuring of the cross-modal latent space.

Contributions

Retrieval as a Primary Task: Unlike prior work where retrieval served as a proxy for evaluation, this paper positions retrieval as a standalone challenge. The approach significantly reduces the median rank of correct motions from 54 to 19, showcasing substantial improvements over existing methodologies.
Extension of TEMOS: By integrating a contrastive loss with the TEMOS model, which already handles text-to-motion synthesis, the research proposes a substantial enhancement in embedding the joint space of text and motion. This is critical for applications in which realistic motion retrieval is preferable or necessary over synthesized motion.
Negative Sampling Refinement: The paper innovatively employs a negative sampling strategy that filters out pairs with high text-text similarity within batches, addressing inherent similarities common in motion descriptions and enhancing retrieval performance.

Evaluation and Results

The research was validated using two datasets: KIT-ML and HumanML3D. These benchmarks illustrate the superiority of TMR over previous models such as the retrieval model by Guo et al. A notable achievement is the reduction of text-motion ambiguities due to the proposed contrastive framework, thereby outperforming prior benchmarks significantly.

Implications and Future Directions

The implications of this research are multifaceted:

Improved Indexing and Searching: By establishing a robust cross-modal embedding space, TMR facilitates efficient searching techniques for large motion databases, important for various applications from animation to virtual reality.
Enhanced Semantic Understanding: The method's enhancement over prior text-to-motion retrieval safeguards the accuracy of motion indexing, supporting tasks such as zero-shot 3D action recognition and moment retrieval in extensive motion sequences.
Complementary to Synthesis: As text-to-motion synthesis progresses, retrieval provides a necessary counterpoint, ensuring high-fidelity and realistic motion sequences are accessible without generative uncertainties.

The paper opens several avenues for future inquiry, notably the exploration of transforming this framework to support simultaneous text and motion synthesis, possibly enhancing the efficiency and accuracy of multi-modal models. Cross-modal retrieval tasks can also benefit from integrating text synthesis capabilities, aligning with trends seen in vision-language frameworks.

In conclusion, this research advances text-to-motion retrieval mechanisms through contrastive learning, setting a new benchmark in motion data utilization. The careful crafting of negative sampling and synthesis retrieval coupling underpins the potential to elevate retrieval systems across AI and machine learning domains.

Related Papers

YouTube

Show All Videos