Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022 (2303.08737v2)

Published 15 Mar 2023 in cs.HC, cs.LG, and cs.MM

Abstract: This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. The dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field. The evaluation results show some synthetic gesture conditions being rated as significantly more human-like than 3D human motion capture. To the best of our knowledge, this has not been demonstrated before. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Fr\'echet gesture distance (FGD), which achieves a Kendall's tau rank correlation of around $-0.5$. Based on the challenge results we formulate numerous recommendations for system building and evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (134)
  1. No gestures left behind: Learning relationships between spoken language and freeform gestures. In Findings of the Association for Computational Linguistics (EMNLP ’20 Findings). 1884–1895. https://doi.org/10.18653/v1/2020.findings-emnlp.170
  2. Low-resource adaptation for personalized co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22). 20566–20576. https://doi.org/10.1109/CVPR52688.2022.01991
  3. Simon Alexanderson. 2020. The StyleGestures entry to the GENEA Challenge 2020. In Proceedings of the GENEA Workshop (GENEA ’20). https://doi.org/10.5281/zenodo.4088599
  4. Style-controllable speech-driven gesture synthesis using normalising flows. Comput. Graph. Forum 39, 2 (2020), 487–496. https://doi.org/10.1111/cgf.13946
  5. Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Transactions on Graphics 42, 4 (July 2023), 1–20. https://doi.org/10.1145/3592458
  6. Rhythmic Gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Trans. Graph. 41, 6, Article 209 (2022), 19 pages. https://doi.org/10.1145/3550454.3555435
  7. Okan Arikan and David A. Forsyth. 2002. Interactive motion generation from examples. ACM Trans. Graph. 21, 3 (2002), 483–490. https://doi.org/10.1145/566570.566606
  8. Molly Babel and Jamie Russell. 2015. Expectations and speech intelligibility. J. Acoust. Soc. Am. 137, 5 (2015), 2823–2833. https://doi.org/10.1121/1.4919317
  9. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems (NeurIPS ’20). 12449–12460. https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html
  10. George Alfred Barnard. 1945. A new test for 2×\times×2 tables. Nature 156, 3954 (1945), 177. https://doi.org/10.1038/156783b0
  11. The relation of speech and gestures: Temporal synchrony follows semantic synchrony. In Proceedings of the Workshop on Gesture and Speech in Interaction (GeSpIn ’11). https://pub.uni-bielefeld.de/record/2392953
  12. Kirsten Bergmann and Stefan Kopp. 2009. GNetIc – Using Bayesian decision networks for iconic gesture generation. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA ’09). 76–89. https://doi.org/10.1007/978-3-642-04380-2_12
  13. Individualized gesturing outperforms average gesturing – Evaluating gesture production in virtual humans. In Proceedings of the International Conference on Intelligent Virtual Agents (ICA ’10). 104–117. https://doi.org/10.1007/978-3-642-15892-6_11
  14. Speech2AffectiveGestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the ACM International Conference on Multimedia (MM ’21). 2027–2036. https://doi.org/10.1145/3474085.3475223
  15. Alan W. Black and Keiichi Tokuda. 2005. The Blizzard Challenge – 2005: Evaluating corpus-based speech synthesis on common datasets. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’05). 77–80. https://doi.org/10.21437/Interspeech.2005-72
  16. Yochai Blau and Tomer Michaeli. 2018. The perception-distortion tradeoff. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’18). 6228–6237. https://doi.org/10.1109/CVPR.2018.00652
  17. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5 (2017), 135–146. https://doi.org/10.1162/tacl_a_00051
  18. Hans Rutger Bosker and David Peeters. 2021. Beat gestures influence which speech sounds you hear. P. Roy. Soc. B 288 (2021), 20202419. https://doi.org/10.1098/rspb.2020.2419
  19. Affect-expressive hand gestures synthesis and animation. In Proceedings of the International Conference on Multimedia and Expo (ICME ’15). 1–6. https://doi.org/10.1109/ICME.2015.7177478
  20. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS ’20). 1877–1901. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
  21. Michael Büttner and Simon Clavet. 2015. Motion matching – the road to next gen animation. In Proceedings of Nucl.ai. https://youtu.be/z_wpgHFSWss
  22. BEAT: The behavior expression animation toolkit. In Proceedings of the Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’01). 477–486. https://doi.org/10.1145/383259.383315
  23. The IVI Lab entry to the GENEA Challenge 2022 – A Tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 784–789. https://doi.org/10.1145/3536221.3558060
  24. Marcela Charfuelan and Ingmar Steiner. 2013. Expressive speech synthesis in MARY TTS using audiobook data and EmotionML. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’13). 1564–1568. https://doi.org/10.21437/Interspeech.2013-395
  25. ChoreoMaster: Choreography-oriented music-driven dance synthesis. ACM Trans. Graph. 40, 4, Article 145 (2021), 13 pages. https://doi.org/10.1145/3450626.3459932
  26. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signa. 16, 6 (2022), 1505–1518. https://doi.org/10.1109/JSTSP.2022.3188113
  27. Predicting co-verbal gestures: A deep and temporal modeling approach. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA ’15). 152–166. https://doi.org/10.1007/978-3-319-21996-7_17
  28. Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE T. Acoust. Speech 28, 4 (1980), 357–366. https://doi.org/10.1109/TASSP.1980.1163420
  29. BERT: Pre-training of deep bidirectional Transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL ’18). 4171–4186. https://doi.org/10.18653/v1/N19-1423
  30. European Broadcasting Union. 2020. Loudness normalisation and permitted maximum level of audio signals. EBU Recommendation EBU R 128v4. https://tech.ebu.ch/docs/r/r128.pdf
  31. Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’18). 93–98. https://doi.org/10.1145/3267851.3267898
  32. ExpressGesture: Expressive gesture generation from speech through database matching. Comput. Animat. Virt. W. 32, 3–4 (2021), e2016. https://doi.org/10.1002/cav.2016
  33. Exemplar-based stylized gesture generation from speech: An entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 778–783. https://doi.org/10.1145/3536221.3558068
  34. Using pupil dilation to measure cognitive load when listening to text-to-speech in quiet and in noise. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’19). 1551–1555. https://doi.org/10.21437/Interspeech.2019-1783
  35. F. Sebastian Grassia. 1998. Practical parameterization of rotations using the exponential map. J. Graph. Tools 3, 3 (1998), 29–48. https://doi.org/10.1080/10867651.1998.10487493
  36. Gerald J. Hahn and William Q. Meeker. 1991. Statistical Intervals: A Guide for Practitioners. Vol. 92. John Wiley & Sons. https://doi.org/10.1002/9780470316771
  37. Evaluating data-driven co-speech gestures of embodied conversational agents through real-time interaction. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’22). Article 8, 8 pages. https://doi.org/10.1145/3514197.3549697
  38. Zhiyuan He. 2022. Automatic quality assessment of speech-driven synthesized gestures. Int. J. Comput. Games. Tech. 2022, Article 1828293 (2022), 11 pages. https://doi.org/10.1155/2022/1828293
  39. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems (NIPS ’17). https://proceedings.neurips.cc/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf
  40. Processing language in face-to-face conversation: Questions with gestures get faster responses. Psychon. B. Rev. 25, 5 (2018), 1900–1908. https://doi.org/10.3758/s13423-017-1363-z
  41. Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 2 (1979), 65–70. https://www.jstor.org/stable/4615733
  42. The VoiceMOS Challenge 2022. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’22). 4536–4540. https://doi.org/10.21437/Interspeech.2022-970
  43. International Telecommunication Union, Telecommunication Standardisation Sector. 1996. Methods for subjective determination of transmission quality. Recommendation ITU-T P.800. https://www.itu.int/rec/T-REC-P.800-199608-I
  44. A speech-driven hand gesture generation method and evaluation in android robots. IEEE Robot. Autom. Lett. 3, 4 (2018), 3757–3764. https://doi.org/10.1109/LRA.2018.2856281
  45. Generating body motions using spoken language in dialogue. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA ’18). 87–92. https://doi.org/10.1145/3267851.3267866
  46. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’20). Article 31, 8 pages. https://doi.org/10.1145/3383652.3423911
  47. Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’20). Article 30, 8 pages. https://doi.org/10.1145/3383652.3423860
  48. HEMVIP: Human evaluation of multiple videos in parallel. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’21). 707–711. https://doi.org/10.1145/3462244.3479957
  49. TransGesture: Autoregressive gesture generation with RNN-transducer. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 753–757. https://doi.org/10.1145/3536221.3558061
  50. Maurice G. Kendall. 1970. Rank Correlation Methods (4 ed.). Charles Griffin & Co.
  51. Simon King. 2014. Measuring a decade of progress in text-to-speech. Loquens 1, 1, Article e006 (2014), 12 pages. https://doi.org/10.3989/loquens.2014.006
  52. ReCell: replicating recurrent cell for auto-regressive pose generation. In Companion publication of the ACM International Conference on Multimodal Interaction (ICMI ’22 Companion). 94–97. https://doi.org/10.1145/3536220.3558801
  53. Audio and text-driven approach for conversational gestures generation. In Proceedings of Computational Linguistics and Intellectual Technologies (DIALOGUE ’21). https://doi.org/10.28995/2075-7182-2021-20-425-432
  54. Motion graphs. ACM Trans. Graph. 21, 3 (2002), 473–482. https://doi.org/10.1145/566654.566605
  55. Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’19). 97–104. https://doi.org/10.1145/3308532.3329472
  56. Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation. Int. J. Hum.–Comput. Int. (2021), 1300–1316. https://doi.org/10.1080/10447318.2021.1883883
  57. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’20). 242–250. https://doi.org/10.1145/3382507.3418815
  58. A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020. In Proceedings of the ACM Annual Conference on Intelligent User Interfaces (IUI ’21). 11–21. https://doi.org/10.1145/3397481.3450692
  59. Speech2Properties2Gestures: Gesture-property prediction as a tool for generating representational gestures from speech. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’21). 145–147. https://doi.org/10.1145/3472306.3478333
  60. Multimodal analysis of the predictability of hand-hesture properties. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS ’22). 770–779. https://doi.org/10.5555/3535850.3535937
  61. The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’23). 792–801. https://doi.org/10.1145/3577190.3616120
  62. Quoc Anh Le and Catherine Pelachaud. 2012. Evaluating an expressive gesture model for a humanoid robot: Experimental results. https://www.researchgate.net/publication/268257868_Evaluating_an_Expressive_Gesture_Model_for_a_Humanoid_Robot_Experimental_Results
  63. Talking With Hands 16.2M: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’19). 763–772. https://doi.org/10.1109/ICCV.2019.00085
  64. Interactive control of avatars animated with human motion data. ACM Trans. Graph. 21, 3 (2002), 491–500. https://doi.org/10.1145/566654.566607
  65. Gesture controllers. ACM Trans. Graph. 29, 4, Article 124 (2010), 11 pages. https://doi.org/10.1145/1778765.1778861
  66. Neural speech synthesis with Transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI ’19, Vol. 33). 6706–6713. https://doi.org/10.1609/aaai.v33i01.33016706
  67. AI Choreographer: Music conditioned 3D dance generation with AIST++. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’21). 13401–13412. https://doi.org/10.1109/ICCV48922.2021.01315
  68. SEEG: Semantic energized co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22). 10473–10482. https://doi.org/10.1109/CVPR52688.2022.01022
  69. BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In Proceedings of the European Conference on Computer Vision (ECCV ’22). 612–630. https://doi.org/10.1007/978-3-031-20071-7_36
  70. Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22). 10462–10472. https://doi.org/10.1109/CVPR52688.2022.01021
  71. Speech-based gesture generation for robots and embodied agents: A scoping review. In Proceedings of the International Conference on Human-Agent Interaction (HAI ’21). 31–38. https://doi.org/10.1145/3472307.3484167
  72. Double-DCCCAE: Estimation of body gestures from speech waveform. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’21). 900–904. https://doi.org/10.1109/ICASSP39728.2021.9414660
  73. Shuhong Lu and Andrew Feng. 2022. The DeepMotion entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). ACM, 790–796. https://doi.org/10.1145/3536221.3558059
  74. Recommended tests for association in 2×\times×2 tables. Stat. Med. 28, 7 (2009), 1159–1175. https://doi.org/10.1002/sim.3531
  75. Objective evaluation metric for motion generative models: Validating Fréchet motion distance on foot skating and over-smoothing artifacts. In Proceedings of the ACM SIGGRAPH Conference on Motion, Interaction and Games (MIG ’23). Article 2, 11 pages. https://doi.org/10.1145/3623264.3624443
  76. Modern speech synthesis for phonetic sciences: a discussion and an evaluation. In Proceedings of the International Congress of Phonetic Sciences (ICPhS ’19). 487–491. https://doi.org/10.31234/osf.io/dxvhc
  77. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746–748. https://doi.org/10.1038/264746a0
  78. David McNeill. 1992. Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press. https://doi.org/10.1177/002383099403700208
  79. Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis. In Proceedings of the ISCA Speech Synthesis Workshop (SSW ’23). https://openreview.net/forum?id=PCZ16_vl_ee
  80. NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106. https://doi.org/10.1145/3503250
  81. Gabriel Mittag and Sebastian Möller. 2020. Deep learning based assessment of synthetic speech naturalness. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’20). 1748–1752. https://doi.org/10.21437/Interspeech.2020-2382
  82. Comparison of approaches for instrumentally predicting the quality of text-to-speech systems. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’10). 1325–1328. https://doi.org/10.21437/Interspeech.2010-413
  83. Gretchen Montgomery and Yan Bing Zhang. 2018. Intergroup anxiety and willingness to accommodate: Exploring the effects of accent stereotyping and social attraction. J. Lang. Soc. Psychol. 37, 3 (2018), 330–349. https://doi.org/10.1177/0261927X17728361
  84. Pietro Morasso. 1981. Spatial control of arm movements. Exp. Brain Res. 42, 2 (1981), 223–227. https://doi.org/10.1007/BF00236911
  85. Mikhail S. Nikulin. 2001. Hellinger distance. In Encyclopedia of Mathematics. Springer. http://encyclopediaofmath.org/index.php?title=Hellinger_distance Accessed: 2021-01-31.
  86. A comprehensive review of data-driven co-speech gesture generation. Comput. Graph. Forum 42, 2 (2023), 569–596. https://doi.org/10.1111/cgf.14776
  87. CGVU: Semantics-guided 3D body gesture synthesis. In Proceedings of the GENEA Workshop (GENEA ’20). https://doi.org/10.5281/zenodo.4090878
  88. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’14). 1532–1543. https://doi.org/10.3115/v1/D14-1162
  89. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML ’21). 8748–8763. https://proceedings.mlr.press/v139/radford21a.html
  90. Multi-task self-supervised learning for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’20). 6989–6993. https://doi.org/10.1109/ICASSP40776.2020.9053569
  91. Passing a non-verbal Turing test: Evaluating gesture animations generated from speech. In Proceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces (VR ’21). 573–581. https://doi.org/10.1109/VR50410.2021.00082
  92. A perceptual investigation of wavelet-based decomposition of f𝑓fitalic_f0 for text-to-speech synthesis. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’15). 1586–1590. https://doi.org/10.21437/Interspeech.2015-368
  93. Najmeh Sadoughi and Carlos Busso. 2019. Speech-driven animation with meaningful behaviors. Speech Commun. 110 (2019), 90–100. https://doi.org/10.1016/j.specom.2019.04.005
  94. Khaled Saleh. 2022. Hybrid seq2seq architecture for 3D co-speech gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 748–752. https://doi.org/10.1145/3536221.3558064
  95. To err is human(-like): Effects of robot gesture on perceived anthropomorphism and likability. Int. J. Soc. Robot. 5, 3 (2013), 313–323. https://doi.org/10.1007/s12369-013-0196-9
  96. Generation and evaluation of communicative robot gesture. Int. J. Soc. Robot. 4, 2 (2012), 201–217. https://doi.org/10.1007/s12369-011-0124-9
  97. A friendly gesture: Investigating the effect of multimodal robot behavior in human-robot interaction. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN ’11). 247–252. https://doi.org/10.1109/ROMAN.2011.6005285
  98. SynFace—Speech-driven facial animation for virtual speech-reading support. EURASIP J. Audio Spee. 2009, Article 191940 (2009), 10 pages. https://doi.org/10.1155/2009/191940
  99. Synthetic speech detection using phase information. Speech Commun. 81 (2016), 30–41. https://doi.org/10.1016/j.specom.2016.04.001
  100. Carolyn Saund and Stacy Marsella. 2021. The importance of qualitative elements in subjective evaluation of semantic gestures. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG ’21). 1–8. https://doi.org/10.1109/FG52635.2021.9667023
  101. Pranab Kumar Sen. 1968. Estimates of the regression coefficient based on Kendall’s tau. J. Am. Stat. Assoc. 63, 324 (1968), 1379–1389. https://doi.org/10.1080/01621459.1968.10480934
  102. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’18). 4799–4783. https://doi.org/10.1109/ICASSP.2018.8461368
  103. Generation of gestures during presentation for humanoid robots. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN ’18). 961–968. https://doi.org/10.1109/ROMAN.2018.8525621
  104. Evaluating expressive speech synthesis from audiobooks in conversational phrases. In Proceedings of the International Conference on Language Resources and Evaluation (LREC ’12). 3335–3339. https://aclanthology.org/L12-1513/
  105. Deep gesture generation for social robots using type-specific libraries. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS ’22). 8286–8291. https://doi.org/10.1109/IROS47612.2022.9981734
  106. Speech gesture generation from acoustic and textual information using LSTMs. In Proceedings of the International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON ’21). 718–723. https://doi.org/10.1109/ECTI-CON51831.2021.9454931
  107. Henri Theil. 1992. A rank-invariant method of linear and polynomial regression analysis. In Henri Theil’s Contributions to Economics and Econometrics: Econometric Theory and Methodology, Baldev Raj and Johan Koerts (Eds.). Springer, 345–381. https://doi.org/10.1007/978-94-011-2546-8_20
  108. Bruce Thompson. 1984. Canonical Correlation Analysis: Uses and Interpretation. Vol. 47. Sage. https://uk.sagepub.com/en-gb/eur/book/canonical-correlation-analysis
  109. CLIC 2020: Overview, and analysis of the competition results. https://youtu.be/iXzgFrRWNEg Accessed: 2024-03-27.
  110. Formation and control of optimal trajectory in human multijoint arm movement. Biol. Cybern. 61, 2 (1989), 89–101. https://doi.org/10.1007/BF00204593
  111. WaveNet: A generative model for raw audio. arXiv:1609.03499
  112. Gesture and speech in interaction: An overview. Speech Commun. 57 (2014), 209–232. https://doi.org/10.1016/j.specom.2013.09.008
  113. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems (NeurIPS ’19). https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html
  114. Integrated speech and gesture synthesis. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’21). 177–185. https://doi.org/10.1145/3462244.3479914
  115. UEA Digital Humans entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 802–810. https://doi.org/10.1145/3577190.3616116
  116. Stephen J. Winters and David B. Pisoni. 2004. Perception and comprehension of synthetic speech. In Research on Spoken Language Processing Progress Report No. 26. Speech Research Laboratory, Department of Psychology, Indiana University, Bloomington, IN, 95–138. https://citeseerx.ist.psu.edu/pdf/8e10a4c4d279e9540cd5af5aae692fe9907409ff
  117. To rate or not to rate: Investigating evaluation methods for generated co-speech gestures. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’21). 494–502. https://doi.org/10.1145/3462244.3479889
  118. “Am I listening?”, Evaluating the quality of generated data-driven listening motion. In Companion publication of the ACM International Conference on Multimodal Interaction (ICMI ’23 Companion). 6–10. https://doi.org/10.1145/3610661.3617160
  119. Should beat gestures be learned or designed? A benchmarking user study. In Proceedings of the ICDL-EpiRob Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions (ICDL-EpiRob ’19 Workshop). 4 pages. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-255998
  120. A review of evaluation practices of gesture generation in embodied conversational agents. IEEE T. Hum.-Mach. Syst. 52, 3 (2022), 379–389. https://doi.org/10.1109/THMS.2022.3149173
  121. Jieyeon Woo. 2021. Development of an interactive human/agent loop using multimodal recurrent neural networks. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’21). 822–826. https://doi.org/10.1145/3462244.3481275
  122. The ReprGesture entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 758–763. https://doi.org/10.1145/3536221.3558066
  123. Gesture2Vec: Clustering gestures using representation learning methods for co-speech gesture generation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS ’22). 3100–3107. https://doi.org/10.1109/IROS47612.2022.9981117
  124. Audio-driven stylized gesture generation with flow-based model. In Proceedings of the European Conference on Computer Vision (ECCV ’22). 712–728. https://doi.org/10.1007/978-3-031-20065-6_41
  125. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. 39, 6, Article 222 (2020), 16 pages. https://doi.org/10.1145/3414685.3417838
  126. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA ’19). 4303–4309. https://doi.org/10.1109/ICRA.2019.8793720
  127. SGToolkit: An interactive gesture authoring toolkit for embodied conversational agents. In Proceedings of the Annual ACM Symposium on User Interface Software and Technology (UIST ’21). ACM, 826–840. https://doi.org/10.1145/3472749.3474789
  128. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 736–747. https://doi.org/10.1145/3536221.3558058
  129. A hierarchical predictor of synthetic speech naturalness using neural networks. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’16). 342–346. https://doi.org/10.21437/Interspeech.2016-847
  130. DiffMotion: Speech-driven gesture synthesis using denoising diffusion model. In Proceedings of the International Conference on Multimedia Modeling (MMM ’23). 231–242. https://doi.org/10.1007/978-3-031-27077-2_18
  131. Mode-adaptive neural networks for quadruped motion control. ACM Trans. Graph. 37, 4, Article 145 (2018), 11 pages. https://doi.org/10.1145/3197517.3201366
  132. NTIRE 2020 challenge on perceptual extreme super-resolution: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR ’20 Workshop). 492–493. https://doi.org/10.1109/CVPRW50498.2020.00254
  133. GestureMaster: Graph-based speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 764–770. https://doi.org/10.1145/3536221.3558063
  134. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’19). 5745–5753. https://doi.org/10.1109/CVPR.2019.00589
Citations (21)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets