Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022 (2303.08737v2)
Abstract: This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. The dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field. The evaluation results show some synthetic gesture conditions being rated as significantly more human-like than 3D human motion capture. To the best of our knowledge, this has not been demonstrated before. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Fr\'echet gesture distance (FGD), which achieves a Kendall's tau rank correlation of around $-0.5$. Based on the challenge results we formulate numerous recommendations for system building and evaluation.
- No gestures left behind: Learning relationships between spoken language and freeform gestures. In Findings of the Association for Computational Linguistics (EMNLP ’20 Findings). 1884–1895. https://doi.org/10.18653/v1/2020.findings-emnlp.170
- Low-resource adaptation for personalized co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22). 20566–20576. https://doi.org/10.1109/CVPR52688.2022.01991
- Simon Alexanderson. 2020. The StyleGestures entry to the GENEA Challenge 2020. In Proceedings of the GENEA Workshop (GENEA ’20). https://doi.org/10.5281/zenodo.4088599
- Style-controllable speech-driven gesture synthesis using normalising flows. Comput. Graph. Forum 39, 2 (2020), 487–496. https://doi.org/10.1111/cgf.13946
- Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Transactions on Graphics 42, 4 (July 2023), 1–20. https://doi.org/10.1145/3592458
- Rhythmic Gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Trans. Graph. 41, 6, Article 209 (2022), 19 pages. https://doi.org/10.1145/3550454.3555435
- Okan Arikan and David A. Forsyth. 2002. Interactive motion generation from examples. ACM Trans. Graph. 21, 3 (2002), 483–490. https://doi.org/10.1145/566570.566606
- Molly Babel and Jamie Russell. 2015. Expectations and speech intelligibility. J. Acoust. Soc. Am. 137, 5 (2015), 2823–2833. https://doi.org/10.1121/1.4919317
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems (NeurIPS ’20). 12449–12460. https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html
- George Alfred Barnard. 1945. A new test for 2×\times×2 tables. Nature 156, 3954 (1945), 177. https://doi.org/10.1038/156783b0
- The relation of speech and gestures: Temporal synchrony follows semantic synchrony. In Proceedings of the Workshop on Gesture and Speech in Interaction (GeSpIn ’11). https://pub.uni-bielefeld.de/record/2392953
- Kirsten Bergmann and Stefan Kopp. 2009. GNetIc – Using Bayesian decision networks for iconic gesture generation. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA ’09). 76–89. https://doi.org/10.1007/978-3-642-04380-2_12
- Individualized gesturing outperforms average gesturing – Evaluating gesture production in virtual humans. In Proceedings of the International Conference on Intelligent Virtual Agents (ICA ’10). 104–117. https://doi.org/10.1007/978-3-642-15892-6_11
- Speech2AffectiveGestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the ACM International Conference on Multimedia (MM ’21). 2027–2036. https://doi.org/10.1145/3474085.3475223
- Alan W. Black and Keiichi Tokuda. 2005. The Blizzard Challenge – 2005: Evaluating corpus-based speech synthesis on common datasets. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’05). 77–80. https://doi.org/10.21437/Interspeech.2005-72
- Yochai Blau and Tomer Michaeli. 2018. The perception-distortion tradeoff. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’18). 6228–6237. https://doi.org/10.1109/CVPR.2018.00652
- Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5 (2017), 135–146. https://doi.org/10.1162/tacl_a_00051
- Hans Rutger Bosker and David Peeters. 2021. Beat gestures influence which speech sounds you hear. P. Roy. Soc. B 288 (2021), 20202419. https://doi.org/10.1098/rspb.2020.2419
- Affect-expressive hand gestures synthesis and animation. In Proceedings of the International Conference on Multimedia and Expo (ICME ’15). 1–6. https://doi.org/10.1109/ICME.2015.7177478
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS ’20). 1877–1901. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
- Michael Büttner and Simon Clavet. 2015. Motion matching – the road to next gen animation. In Proceedings of Nucl.ai. https://youtu.be/z_wpgHFSWss
- BEAT: The behavior expression animation toolkit. In Proceedings of the Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’01). 477–486. https://doi.org/10.1145/383259.383315
- The IVI Lab entry to the GENEA Challenge 2022 – A Tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 784–789. https://doi.org/10.1145/3536221.3558060
- Marcela Charfuelan and Ingmar Steiner. 2013. Expressive speech synthesis in MARY TTS using audiobook data and EmotionML. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’13). 1564–1568. https://doi.org/10.21437/Interspeech.2013-395
- ChoreoMaster: Choreography-oriented music-driven dance synthesis. ACM Trans. Graph. 40, 4, Article 145 (2021), 13 pages. https://doi.org/10.1145/3450626.3459932
- WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signa. 16, 6 (2022), 1505–1518. https://doi.org/10.1109/JSTSP.2022.3188113
- Predicting co-verbal gestures: A deep and temporal modeling approach. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA ’15). 152–166. https://doi.org/10.1007/978-3-319-21996-7_17
- Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE T. Acoust. Speech 28, 4 (1980), 357–366. https://doi.org/10.1109/TASSP.1980.1163420
- BERT: Pre-training of deep bidirectional Transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL ’18). 4171–4186. https://doi.org/10.18653/v1/N19-1423
- European Broadcasting Union. 2020. Loudness normalisation and permitted maximum level of audio signals. EBU Recommendation EBU R 128v4. https://tech.ebu.ch/docs/r/r128.pdf
- Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’18). 93–98. https://doi.org/10.1145/3267851.3267898
- ExpressGesture: Expressive gesture generation from speech through database matching. Comput. Animat. Virt. W. 32, 3–4 (2021), e2016. https://doi.org/10.1002/cav.2016
- Exemplar-based stylized gesture generation from speech: An entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 778–783. https://doi.org/10.1145/3536221.3558068
- Using pupil dilation to measure cognitive load when listening to text-to-speech in quiet and in noise. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’19). 1551–1555. https://doi.org/10.21437/Interspeech.2019-1783
- F. Sebastian Grassia. 1998. Practical parameterization of rotations using the exponential map. J. Graph. Tools 3, 3 (1998), 29–48. https://doi.org/10.1080/10867651.1998.10487493
- Gerald J. Hahn and William Q. Meeker. 1991. Statistical Intervals: A Guide for Practitioners. Vol. 92. John Wiley & Sons. https://doi.org/10.1002/9780470316771
- Evaluating data-driven co-speech gestures of embodied conversational agents through real-time interaction. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’22). Article 8, 8 pages. https://doi.org/10.1145/3514197.3549697
- Zhiyuan He. 2022. Automatic quality assessment of speech-driven synthesized gestures. Int. J. Comput. Games. Tech. 2022, Article 1828293 (2022), 11 pages. https://doi.org/10.1155/2022/1828293
- GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems (NIPS ’17). https://proceedings.neurips.cc/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf
- Processing language in face-to-face conversation: Questions with gestures get faster responses. Psychon. B. Rev. 25, 5 (2018), 1900–1908. https://doi.org/10.3758/s13423-017-1363-z
- Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 2 (1979), 65–70. https://www.jstor.org/stable/4615733
- The VoiceMOS Challenge 2022. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’22). 4536–4540. https://doi.org/10.21437/Interspeech.2022-970
- International Telecommunication Union, Telecommunication Standardisation Sector. 1996. Methods for subjective determination of transmission quality. Recommendation ITU-T P.800. https://www.itu.int/rec/T-REC-P.800-199608-I
- A speech-driven hand gesture generation method and evaluation in android robots. IEEE Robot. Autom. Lett. 3, 4 (2018), 3757–3764. https://doi.org/10.1109/LRA.2018.2856281
- Generating body motions using spoken language in dialogue. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA ’18). 87–92. https://doi.org/10.1145/3267851.3267866
- Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’20). Article 31, 8 pages. https://doi.org/10.1145/3383652.3423911
- Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’20). Article 30, 8 pages. https://doi.org/10.1145/3383652.3423860
- HEMVIP: Human evaluation of multiple videos in parallel. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’21). 707–711. https://doi.org/10.1145/3462244.3479957
- TransGesture: Autoregressive gesture generation with RNN-transducer. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 753–757. https://doi.org/10.1145/3536221.3558061
- Maurice G. Kendall. 1970. Rank Correlation Methods (4 ed.). Charles Griffin & Co.
- Simon King. 2014. Measuring a decade of progress in text-to-speech. Loquens 1, 1, Article e006 (2014), 12 pages. https://doi.org/10.3989/loquens.2014.006
- ReCell: replicating recurrent cell for auto-regressive pose generation. In Companion publication of the ACM International Conference on Multimodal Interaction (ICMI ’22 Companion). 94–97. https://doi.org/10.1145/3536220.3558801
- Audio and text-driven approach for conversational gestures generation. In Proceedings of Computational Linguistics and Intellectual Technologies (DIALOGUE ’21). https://doi.org/10.28995/2075-7182-2021-20-425-432
- Motion graphs. ACM Trans. Graph. 21, 3 (2002), 473–482. https://doi.org/10.1145/566654.566605
- Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’19). 97–104. https://doi.org/10.1145/3308532.3329472
- Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation. Int. J. Hum.–Comput. Int. (2021), 1300–1316. https://doi.org/10.1080/10447318.2021.1883883
- Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’20). 242–250. https://doi.org/10.1145/3382507.3418815
- A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020. In Proceedings of the ACM Annual Conference on Intelligent User Interfaces (IUI ’21). 11–21. https://doi.org/10.1145/3397481.3450692
- Speech2Properties2Gestures: Gesture-property prediction as a tool for generating representational gestures from speech. In Proceedings of the ACM International Conference on Intelligent Virtual Agents (IVA ’21). 145–147. https://doi.org/10.1145/3472306.3478333
- Multimodal analysis of the predictability of hand-hesture properties. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS ’22). 770–779. https://doi.org/10.5555/3535850.3535937
- The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’23). 792–801. https://doi.org/10.1145/3577190.3616120
- Quoc Anh Le and Catherine Pelachaud. 2012. Evaluating an expressive gesture model for a humanoid robot: Experimental results. https://www.researchgate.net/publication/268257868_Evaluating_an_Expressive_Gesture_Model_for_a_Humanoid_Robot_Experimental_Results
- Talking With Hands 16.2M: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’19). 763–772. https://doi.org/10.1109/ICCV.2019.00085
- Interactive control of avatars animated with human motion data. ACM Trans. Graph. 21, 3 (2002), 491–500. https://doi.org/10.1145/566654.566607
- Gesture controllers. ACM Trans. Graph. 29, 4, Article 124 (2010), 11 pages. https://doi.org/10.1145/1778765.1778861
- Neural speech synthesis with Transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI ’19, Vol. 33). 6706–6713. https://doi.org/10.1609/aaai.v33i01.33016706
- AI Choreographer: Music conditioned 3D dance generation with AIST++. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’21). 13401–13412. https://doi.org/10.1109/ICCV48922.2021.01315
- SEEG: Semantic energized co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22). 10473–10482. https://doi.org/10.1109/CVPR52688.2022.01022
- BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In Proceedings of the European Conference on Computer Vision (ECCV ’22). 612–630. https://doi.org/10.1007/978-3-031-20071-7_36
- Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22). 10462–10472. https://doi.org/10.1109/CVPR52688.2022.01021
- Speech-based gesture generation for robots and embodied agents: A scoping review. In Proceedings of the International Conference on Human-Agent Interaction (HAI ’21). 31–38. https://doi.org/10.1145/3472307.3484167
- Double-DCCCAE: Estimation of body gestures from speech waveform. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’21). 900–904. https://doi.org/10.1109/ICASSP39728.2021.9414660
- Shuhong Lu and Andrew Feng. 2022. The DeepMotion entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). ACM, 790–796. https://doi.org/10.1145/3536221.3558059
- Recommended tests for association in 2×\times×2 tables. Stat. Med. 28, 7 (2009), 1159–1175. https://doi.org/10.1002/sim.3531
- Objective evaluation metric for motion generative models: Validating Fréchet motion distance on foot skating and over-smoothing artifacts. In Proceedings of the ACM SIGGRAPH Conference on Motion, Interaction and Games (MIG ’23). Article 2, 11 pages. https://doi.org/10.1145/3623264.3624443
- Modern speech synthesis for phonetic sciences: a discussion and an evaluation. In Proceedings of the International Congress of Phonetic Sciences (ICPhS ’19). 487–491. https://doi.org/10.31234/osf.io/dxvhc
- Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746–748. https://doi.org/10.1038/264746a0
- David McNeill. 1992. Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press. https://doi.org/10.1177/002383099403700208
- Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis. In Proceedings of the ISCA Speech Synthesis Workshop (SSW ’23). https://openreview.net/forum?id=PCZ16_vl_ee
- NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106. https://doi.org/10.1145/3503250
- Gabriel Mittag and Sebastian Möller. 2020. Deep learning based assessment of synthetic speech naturalness. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’20). 1748–1752. https://doi.org/10.21437/Interspeech.2020-2382
- Comparison of approaches for instrumentally predicting the quality of text-to-speech systems. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’10). 1325–1328. https://doi.org/10.21437/Interspeech.2010-413
- Gretchen Montgomery and Yan Bing Zhang. 2018. Intergroup anxiety and willingness to accommodate: Exploring the effects of accent stereotyping and social attraction. J. Lang. Soc. Psychol. 37, 3 (2018), 330–349. https://doi.org/10.1177/0261927X17728361
- Pietro Morasso. 1981. Spatial control of arm movements. Exp. Brain Res. 42, 2 (1981), 223–227. https://doi.org/10.1007/BF00236911
- Mikhail S. Nikulin. 2001. Hellinger distance. In Encyclopedia of Mathematics. Springer. http://encyclopediaofmath.org/index.php?title=Hellinger_distance Accessed: 2021-01-31.
- A comprehensive review of data-driven co-speech gesture generation. Comput. Graph. Forum 42, 2 (2023), 569–596. https://doi.org/10.1111/cgf.14776
- CGVU: Semantics-guided 3D body gesture synthesis. In Proceedings of the GENEA Workshop (GENEA ’20). https://doi.org/10.5281/zenodo.4090878
- GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’14). 1532–1543. https://doi.org/10.3115/v1/D14-1162
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML ’21). 8748–8763. https://proceedings.mlr.press/v139/radford21a.html
- Multi-task self-supervised learning for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’20). 6989–6993. https://doi.org/10.1109/ICASSP40776.2020.9053569
- Passing a non-verbal Turing test: Evaluating gesture animations generated from speech. In Proceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces (VR ’21). 573–581. https://doi.org/10.1109/VR50410.2021.00082
- A perceptual investigation of wavelet-based decomposition of f𝑓fitalic_f0 for text-to-speech synthesis. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’15). 1586–1590. https://doi.org/10.21437/Interspeech.2015-368
- Najmeh Sadoughi and Carlos Busso. 2019. Speech-driven animation with meaningful behaviors. Speech Commun. 110 (2019), 90–100. https://doi.org/10.1016/j.specom.2019.04.005
- Khaled Saleh. 2022. Hybrid seq2seq architecture for 3D co-speech gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 748–752. https://doi.org/10.1145/3536221.3558064
- To err is human(-like): Effects of robot gesture on perceived anthropomorphism and likability. Int. J. Soc. Robot. 5, 3 (2013), 313–323. https://doi.org/10.1007/s12369-013-0196-9
- Generation and evaluation of communicative robot gesture. Int. J. Soc. Robot. 4, 2 (2012), 201–217. https://doi.org/10.1007/s12369-011-0124-9
- A friendly gesture: Investigating the effect of multimodal robot behavior in human-robot interaction. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN ’11). 247–252. https://doi.org/10.1109/ROMAN.2011.6005285
- SynFace—Speech-driven facial animation for virtual speech-reading support. EURASIP J. Audio Spee. 2009, Article 191940 (2009), 10 pages. https://doi.org/10.1155/2009/191940
- Synthetic speech detection using phase information. Speech Commun. 81 (2016), 30–41. https://doi.org/10.1016/j.specom.2016.04.001
- Carolyn Saund and Stacy Marsella. 2021. The importance of qualitative elements in subjective evaluation of semantic gestures. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG ’21). 1–8. https://doi.org/10.1109/FG52635.2021.9667023
- Pranab Kumar Sen. 1968. Estimates of the regression coefficient based on Kendall’s tau. J. Am. Stat. Assoc. 63, 324 (1968), 1379–1389. https://doi.org/10.1080/01621459.1968.10480934
- Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’18). 4799–4783. https://doi.org/10.1109/ICASSP.2018.8461368
- Generation of gestures during presentation for humanoid robots. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN ’18). 961–968. https://doi.org/10.1109/ROMAN.2018.8525621
- Evaluating expressive speech synthesis from audiobooks in conversational phrases. In Proceedings of the International Conference on Language Resources and Evaluation (LREC ’12). 3335–3339. https://aclanthology.org/L12-1513/
- Deep gesture generation for social robots using type-specific libraries. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS ’22). 8286–8291. https://doi.org/10.1109/IROS47612.2022.9981734
- Speech gesture generation from acoustic and textual information using LSTMs. In Proceedings of the International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON ’21). 718–723. https://doi.org/10.1109/ECTI-CON51831.2021.9454931
- Henri Theil. 1992. A rank-invariant method of linear and polynomial regression analysis. In Henri Theil’s Contributions to Economics and Econometrics: Econometric Theory and Methodology, Baldev Raj and Johan Koerts (Eds.). Springer, 345–381. https://doi.org/10.1007/978-94-011-2546-8_20
- Bruce Thompson. 1984. Canonical Correlation Analysis: Uses and Interpretation. Vol. 47. Sage. https://uk.sagepub.com/en-gb/eur/book/canonical-correlation-analysis
- CLIC 2020: Overview, and analysis of the competition results. https://youtu.be/iXzgFrRWNEg Accessed: 2024-03-27.
- Formation and control of optimal trajectory in human multijoint arm movement. Biol. Cybern. 61, 2 (1989), 89–101. https://doi.org/10.1007/BF00204593
- WaveNet: A generative model for raw audio. arXiv:1609.03499
- Gesture and speech in interaction: An overview. Speech Commun. 57 (2014), 209–232. https://doi.org/10.1016/j.specom.2013.09.008
- SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems (NeurIPS ’19). https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html
- Integrated speech and gesture synthesis. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’21). 177–185. https://doi.org/10.1145/3462244.3479914
- UEA Digital Humans entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 802–810. https://doi.org/10.1145/3577190.3616116
- Stephen J. Winters and David B. Pisoni. 2004. Perception and comprehension of synthetic speech. In Research on Spoken Language Processing Progress Report No. 26. Speech Research Laboratory, Department of Psychology, Indiana University, Bloomington, IN, 95–138. https://citeseerx.ist.psu.edu/pdf/8e10a4c4d279e9540cd5af5aae692fe9907409ff
- To rate or not to rate: Investigating evaluation methods for generated co-speech gestures. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’21). 494–502. https://doi.org/10.1145/3462244.3479889
- “Am I listening?”, Evaluating the quality of generated data-driven listening motion. In Companion publication of the ACM International Conference on Multimodal Interaction (ICMI ’23 Companion). 6–10. https://doi.org/10.1145/3610661.3617160
- Should beat gestures be learned or designed? A benchmarking user study. In Proceedings of the ICDL-EpiRob Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions (ICDL-EpiRob ’19 Workshop). 4 pages. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-255998
- A review of evaluation practices of gesture generation in embodied conversational agents. IEEE T. Hum.-Mach. Syst. 52, 3 (2022), 379–389. https://doi.org/10.1109/THMS.2022.3149173
- Jieyeon Woo. 2021. Development of an interactive human/agent loop using multimodal recurrent neural networks. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’21). 822–826. https://doi.org/10.1145/3462244.3481275
- The ReprGesture entry to the GENEA Challenge 2022. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 758–763. https://doi.org/10.1145/3536221.3558066
- Gesture2Vec: Clustering gestures using representation learning methods for co-speech gesture generation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS ’22). 3100–3107. https://doi.org/10.1109/IROS47612.2022.9981117
- Audio-driven stylized gesture generation with flow-based model. In Proceedings of the European Conference on Computer Vision (ECCV ’22). 712–728. https://doi.org/10.1007/978-3-031-20065-6_41
- Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. 39, 6, Article 222 (2020), 16 pages. https://doi.org/10.1145/3414685.3417838
- Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA ’19). 4303–4309. https://doi.org/10.1109/ICRA.2019.8793720
- SGToolkit: An interactive gesture authoring toolkit for embodied conversational agents. In Proceedings of the Annual ACM Symposium on User Interface Software and Technology (UIST ’21). ACM, 826–840. https://doi.org/10.1145/3472749.3474789
- The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 736–747. https://doi.org/10.1145/3536221.3558058
- A hierarchical predictor of synthetic speech naturalness using neural networks. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’16). 342–346. https://doi.org/10.21437/Interspeech.2016-847
- DiffMotion: Speech-driven gesture synthesis using denoising diffusion model. In Proceedings of the International Conference on Multimedia Modeling (MMM ’23). 231–242. https://doi.org/10.1007/978-3-031-27077-2_18
- Mode-adaptive neural networks for quadruped motion control. ACM Trans. Graph. 37, 4, Article 145 (2018), 11 pages. https://doi.org/10.1145/3197517.3201366
- NTIRE 2020 challenge on perceptual extreme super-resolution: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR ’20 Workshop). 492–493. https://doi.org/10.1109/CVPRW50498.2020.00254
- GestureMaster: Graph-based speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’22). 764–770. https://doi.org/10.1145/3536221.3558063
- On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’19). 5745–5753. https://doi.org/10.1109/CVPR.2019.00589