Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment (2403.18811v1)

Published 27 Mar 2024 in cs.CV, cs.GR, cs.SD, and eess.AS

Abstract: We introduce a novel task within the field of 3D dance generation, termed dance accompaniment, which necessitates the generation of responsive movements from a dance partner, the "follower", synchronized with the lead dancer's movements and the underlying musical rhythm. Unlike existing solo or group dance generation tasks, a duet dance scenario entails a heightened degree of interaction between the two participants, requiring delicate coordination in both pose and position. To support this task, we first build a large-scale and diverse duet interactive dance dataset, DD100, by recording about 117 minutes of professional dancers' performances. To address the challenges inherent in this task, we propose a GPT-based model, Duolando, which autoregressively predicts the subsequent tokenized motion conditioned on the coordinated information of the music, the leader's and the follower's movements. To further enhance the GPT's capabilities of generating stable results on unseen conditions (music and leader motions), we devise an off-policy reinforcement learning strategy that allows the model to explore viable trajectories from out-of-distribution samplings, guided by human-defined rewards. Based on the collected dataset and proposed method, we establish a benchmark with several carefully designed metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Cmu mocap. http://mocap.cs.cmu.edu/.
  2. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. In SIGGRAPH Asia, 2022.
  3. Teach: Temporal action composition for 3d humans. arXiv preprint arXiv:2209.04066, 2022.
  4. Playing for 3d human recovery. arXiv preprint arXiv:2110.07588, 2021.
  5. Humman: Multi-modal 4d human dataset for versatile sensing and modeling. In ECCV, 2022.
  6. Exploring simple siamese representation learning. In CVPR, 2021.
  7. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
  8. Soma: Solving optical marker-based mocap automatically. In ICCV, 2021.
  9. Imos: Intent-driven full-body motion synthesis for human-object interactions. In Computer Graphics Forum, 2023.
  10. Tm2d: Bimodality driven 3d dance generation via music-text integration. In ICCV, 2023.
  11. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In ECCV, 2022.
  12. Reinforcement learning with deep energy-based policies. In ICML, 2017.
  13. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
  14. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
  15. Dance revolution: Long-term dance generation with music via curriculum learning. In ICLR, 2021.
  16. Interactive synthesis of human-object interaction. In SIGGRAPH, 2009.
  17. Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795, 2023.
  18. Andrej Karpathy. https://github.com/karpathy/minGPT, 2020.
  19. Adam: A method for stochastic optimization. In ICLR, 2014.
  20. Music-driven group choreography. In CVPR, 2023.
  21. Dancing to music. In NeurIPS, 2019.
  22. Sergey Levine. https://rail.eecs.berkeley.edu/deeprlcourse/.
  23. Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In AAAI, 2022.
  24. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In ICCV, 2021a.
  25. Task-oriented human-object interactions generation with implicit neural representations. arXiv preprint arXiv:2303.13129, 2023.
  26. Ai choreographer: Music conditioned 3d dance generation with aist++. In ICCV, 2021b.
  27. Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
  28. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE TPAMI, 42(10), 2020.
  29. Amass: Archive of motion capture as surface shapes. In ICCV, 2019.
  30. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, 2015.
  31. Gan-based reactive motion synthesis with class-aware discriminators for human–human interaction. CG, 102, 2022.
  32. Efficient content-based retrieval of motion capture data. In SIGGRAPH. 2005.
  33. You2me: Inferring body pose in egocentric video via first and second person interactions. In CVPR, 2020.
  34. Learning to listen: Modeling non-deterministic dyadic facial motion. In CVPR, 2022.
  35. Fmdistance: A fast and effective distance function for motion capture data. In Eurographics, 2008.
  36. Training language models to follow instructions with human feedback. NeurIPS, 2022.
  37. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
  38. Action-conditioned 3d human motion synthesis with transformer vae. In ICCV, 2021.
  39. Temos: Generating diverse human motions from textual descriptions. In ECCV, 2022.
  40. Self-supervised dance video synthesis conditioned on music. In ACM MM, 2020.
  41. Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610, 2022.
  42. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
  43. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In CVPR, 2022.
  44. Deepdance: music-to-dance motion choreography with adversarial learning. TMM, 23, 2020.
  45. You never stop dancing: Non-freezing dance generation via bank-constrained manifold projection. In NeurIPS, 2022.
  46. Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In ACM MM, 2018.
  47. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  48. Edge: editable dance generation from music. In CVPR, 2023.
  49. Transflower: probabilistic autoregressive dance generation with multimodal attention. ACM TOG, 40(6), 2021.
  50. Neural discrete representation learning. NeurIPS, 2017.
  51. Umpm benchmark: A multi-person dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction. In ICCVW, 2011.
  52. Attention is all you need. ArXiv, abs/1706.03762, 2017.
  53. Somoformer: Multi-person pose forecasting with transformers. arXiv preprint arXiv:2208.14023, 2022.
  54. Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models. In IJCAI, 2023a.
  55. Synbody: Synthetic dataset with layered human models for 3d human perception and modeling. In ICCV, 2023b.
  56. Human-aware object placement for visual environment reconstruction. In CVPR, 2022.
  57. Mime: Human-aware 3d scene generation. In CVPR, 2023.
  58. T2m-gpt: Generating human motion from textual descriptions with discrete representations. In CVPR, 2023a.
  59. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
  60. Remodiffuse: Retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116, 2023b.
  61. Compositional human-scene interaction synthesis with semantic control. In ECCV, 2022.
  62. Music2dance: Dancenet for music-driven dance generation. ACM TOMM, 18(2), 2022.
Citations (9)

Summary

  • The paper introduces Duolando, a novel model that leverages GPT with off-policy reinforcement learning to generate synchronized dance follower movements.
  • It utilizes the DD100 dataset, featuring 117 minutes of MoCap-recorded performances across 10 dance genres to train and benchmark the system.
  • The method enhances stability by defining human-understandable rewards, effectively reducing artifacts such as foot skating in the generated motions.

An Overview of Duolando: Leveraging GPT with Off-Policy Reinforcement Learning for Dance Accompaniment

Introduction

The art of duet dancing, particularly in scenarios such as ballroom dancing, entails a synchronized coordination between two dance partners often referred to as the leader and the follower. The development of computational models enabling virtual agents to accompany a human dancer poses considerable practical value in augmented and virtual reality applications. Addressing this, the paper introduces Duolando, a model aimed at generating the responsive movements of a dance follower. At its core, Duolando leverages the Generative Pretrained Transformer (GPT) model enhanced with off-policy reinforcement learning, conditioned on the coordinated music and leader's movements, to generate the follower's dance movements.

DD100: A Comprehensive Duet Dance Dataset

A significant contribution of this research is the development of the DD100 dataset, the first of its kind, specifically designed to cater to the duet dance accompaniment task. The dataset encompasses approximately 117 minutes of professional dance performances covering a broad spectrum of 10 dance genres. Performance data was meticulously recorded using high-end motion capture (MoCap) technology to ensure the accuracy and richness of the motion data. This dataset not only serves as a training and benchmarking foundation for the paper but also advances the frontier in research on interactive dance generation.

The Duolando Framework

Proposing a novel framework, Duolando addresses the nuanced task of generating follower dance movements that are not only rhythmically coherent with the music but also in synchronized coordination with the leader's dance movements. The model architecture is devised in two stages: initially employing VQ-VAEs to tokenize the movement data into a quantized sequence, followed by utilizing a GPT-based network to autoregressively predict subsequent movement tokens conditioned on the input music, the leader’s movements, and the preceding sequence of the follower’s movements. A notable feature of Duolando is the incorporation of an off-policy reinforcement learning strategy aimed at enhancing the model's aptitude in generating plausible sequences under unseen conditions.

Enhanced with Off-Policy Reinforcement Learning

A pivotal aspect of the Duolando model is the use of off-policy reinforcement learning to address the generation stability issues that surface under out-of-distribution inputs. By defining explicit human-understandable rewards, the model is trained to explore feasible movement trajectories, thereby ensuring the generated follower movements maintain a natural and synchronized coordination without introducing artifacts such as foot skating.

Establishing a New Benchmark

Grounded on the presented DD100 dataset and the proposed Duolando model, a comprehensive benchmark is established, accompanied by an array of carefully crafted metrics. These metrics not only evaluate the intrinsic quality of the generated follower movements but also assess the interaction dynamics between the dance partners and their alignment with the underlying musical rhythm. The introduction of such a benchmark is expected to propel forward the research in the field of interactive dance generation, laying down a foundational stone for future explorations.

Concluding Remarks

The paper introduces a significant leap towards understanding and synthesizing the intricate nature of human-human interaction in the context of duet dancing. By proposing the Duolando model and establishing the DD100 dataset, this work not only presents a novel approach to dance accompaniment but also sets a new benchmark for evaluating dance interaction models. The implications of this research extend beyond the academic domain, harboring potential transformative impacts on the development of interactive entertainment and training applications in virtual environments.

Youtube Logo Streamline Icon: https://streamlinehq.com