Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What is the Best Automated Metric for Text to Motion Generation? (2309.10248v1)

Published 19 Sep 2023 in cs.CL, cs.GR, and cs.LG

Abstract: There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing effective generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and less-used coordinate errors show strong correlations. Additionally, several recently developed metrics are not recommended due to their low correlation compared to alternatives. We also introduce a novel metric based on a multimodal BERT-like model, MoBERT, which offers strongly human-correlated sample-level evaluations while maintaining near-perfect model-level correlation. Our results demonstrate that this new metric exhibits extensive benefits over all current alternatives.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural Language Grounded Pose Forecasting. In 3DV. IEEE, 719–728.
  2. A Stochastic Conditioning Scheme for Diverse Human Motion Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Washington, DC, USA, 8 pages.
  3. HP-GAN: Probabilistic 3D human motion prediction via GAN. https://doi.org/10.48550/ARXIV.1711.09561
  4. Executing your Commands via Motion Diffusion in Latent Space. ArXiv abs/2212.04048 (2022).
  5. Blender Online Community. 2018. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam. http://www.blender.org
  6. PoseScript: 3D Human Poses from Natural Language. In ECCV (6) (Lecture Notes in Computer Science, Vol. 13666). Springer, New York, NY, USA, 346–362.
  7. Philip Gage. 1994. A new algorithm for data compression. The C Users Journal archive 12 (1994), 23–38. https://api.semanticscholar.org/CorpusID:59804030
  8. Synthesis of Compositional Animations from Textual Descriptions. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 1376–1386. https://doi.org/10.1109/ICCV48922.2021.00143
  9. Generative Adversarial Networks. https://doi.org/10.48550/ARXIV.1406.2661
  10. Generating Diverse and Natural 3D Human Motions from Text. In CVPR. IEEE, Washington, DC, USA, 5142–5151.
  11. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In ECCV (35) (Lecture Notes in Computer Science, Vol. 13695). Springer, New York, NY, USA, 580–597.
  12. Action2Motion: Conditioned Generation of 3D Human Motions. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM ’20). Association for Computing Machinery, New York, NY, USA, 2021–2029. https://doi.org/10.1145/3394171.3413635
  13. AMD: Autoregressive Motion Diffusion. ArXiv abs/2305.09381 (2023).
  14. MoGlow. ACM Transactions on Graphics 39, 6 (nov 2020), 1–14. https://doi.org/10.1145/3414685.3417836
  15. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. arXiv:2104.08718 [cs.CV]
  16. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc., Red Hook, NY, USA. https://proceedings.neurips.cc/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf
  17. Phase-Functioned Neural Networks for Character Control. ACM Trans. Graph. 36, 4, Article 42 (jul 2017), 13 pages. https://doi.org/10.1145/3072959.3073663
  18. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. https://doi.org/10.48550/ARXIV.2205.08535
  19. Diffusion-based Generation, Optimization, and Planning in 3D Scenes. arXiv preprint arXiv:2301.06015 (2023).
  20. Generalizing Motion Edits with Gaussian Processes. ACM Trans. Graph. 28, 1, Article 1 (feb 2009), 12 pages. https://doi.org/10.1145/1477926.1477927
  21. MotionGPT: Human Motion as a Foreign Language. arXiv preprint arXiv:2306.14795 (2023).
  22. FLAME: Free-form Language-based Motion Synthesis & Editing.
  23. Diederik P Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. https://doi.org/10.48550/ARXIV.1312.6114
  24. A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020. In 26th International Conference on Intelligent User Interfaces (College Station, TX, USA) (IUI ’21). Association for Computing Machinery, New York, NY, USA, 11–21. https://doi.org/10.1145/3397481.3450692
  25. Learning to Generate Diverse Dance Motions with Transformer.
  26. Generating Animated Videos of Human Activities from Natural Language Descriptions. In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS. 4 pages.
  27. Character controllers using motion VAEs. ACM Transactions on Graphics 39, 4 (aug 2020). https://doi.org/10.1145/3386569.3392422
  28. Character Controllers Using Motion VAEs. ACM Trans. Graph. 39, 4 (2020).
  29. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
  30. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. 34, 6, Article 248 (nov 2015), 16 pages. https://doi.org/10.1145/2816795.2818013
  31. Tomohiko Mukai and Shigeru Kuriyama. 2005. Geostatistical motion interpolation. ACM Trans. Graph. 24 (2005), 1062–1070.
  32. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL ’02). Association for Computational Linguistics, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  33. Action-Conditioned 3D Human Motion Synthesis with Transformer VAE. https://doi.org/10.48550/ARXIV.2104.05670
  34. TEMOS: Generating Diverse Human Motions from Textual Descriptions. In ECCV (22) (Lecture Notes in Computer Science, Vol. 13682). Springer, New York, NY, USA, 480–497.
  35. BABEL: Bodies, Action and Behavior With English Labels. In CVPR. Computer Vision Foundation / IEEE, Washington, DC, USA, 722–731.
  36. Hierarchical Text-Conditional Image Generation with CLIP Latents. https://doi.org/10.48550/ARXIV.2204.06125
  37. HuMoR: 3D Human Motion Model for Robust Pose Estimation. arXiv:2105.04668 [cs.CV]
  38. Philip Sedgwick. 2012. Pearson’s correlation coefficient. BMJ 345 (2012). https://doi.org/10.1136/bmj.e4483 arXiv:https://www.bmj.com/content/345/bmj.e4483.full.pdf
  39. Human Motion Diffusion as a Generative Prior. ArXiv abs/2303.01418 (2023).
  40. Make-A-Video: Text-to-Video Generation without Text-Video Data. https://doi.org/10.48550/ARXIV.2209.14792
  41. Human Motion Diffusion Model. https://doi.org/10.48550/ARXIV.2209.14916
  42. EDGE: Editable Dance Generation From Music. arXiv:2211.10658 [cs.SD]
  43. MoCoGAN: Decomposing Motion and Content for Video Generation. https://doi.org/10.48550/ARXIV.1707.04993
  44. Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762
  45. Understanding Text-driven Motion Synthesis with Keyframe Collaboration via Diffusion Models. ArXiv abs/2305.13773 (2023).
  46. The GENEA Challenge 2022: A Large Evaluation of Data-Driven Co-Speech Gesture Generation. In Proceedings of the 2022 International Conference on Multimodal Interaction (Bengaluru, India) (ICMI ’22). Association for Computing Machinery, New York, NY, USA, 736–747. https://doi.org/10.1145/3536221.3558058
  47. CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes. In ECCV (3) (Lecture Notes in Computer Science, Vol. 13663). Springer, New York, NY, USA, 173–191.
  48. PhysDiff: Physics-Guided Human Motion Diffusion Model. ArXiv abs/2212.02500 (2022).
  49. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. ArXiv abs/2301.06052 (2023).
  50. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model.
  51. ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model. ArXiv abs/2304.01116 (2023).
  52. BERTScore: Evaluating Text Generation with BERT. https://doi.org/10.48550/ARXIV.1904.09675
  53. MotionGPT: Finetuned LLMs are General-Purpose Motion Generators. arXiv preprint arXiv:2306.10900 (2023).
  54. Zixiang Zhou and Baoyuan Wang. 2022. UDE: A Unified Driving Engine for Human Motion Generation. ArXiv abs/2211.16016 (2022).
Citations (11)

Summary

We haven't generated a summary for this paper yet.