Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment (2403.18811v1)
Abstract: We introduce a novel task within the field of 3D dance generation, termed dance accompaniment, which necessitates the generation of responsive movements from a dance partner, the "follower", synchronized with the lead dancer's movements and the underlying musical rhythm. Unlike existing solo or group dance generation tasks, a duet dance scenario entails a heightened degree of interaction between the two participants, requiring delicate coordination in both pose and position. To support this task, we first build a large-scale and diverse duet interactive dance dataset, DD100, by recording about 117 minutes of professional dancers' performances. To address the challenges inherent in this task, we propose a GPT-based model, Duolando, which autoregressively predicts the subsequent tokenized motion conditioned on the coordinated information of the music, the leader's and the follower's movements. To further enhance the GPT's capabilities of generating stable results on unseen conditions (music and leader motions), we devise an off-policy reinforcement learning strategy that allows the model to explore viable trajectories from out-of-distribution samplings, guided by human-defined rewards. Based on the collected dataset and proposed method, we establish a benchmark with several carefully designed metrics.
- Cmu mocap. http://mocap.cs.cmu.edu/.
- Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. In SIGGRAPH Asia, 2022.
- Teach: Temporal action composition for 3d humans. arXiv preprint arXiv:2209.04066, 2022.
- Playing for 3d human recovery. arXiv preprint arXiv:2110.07588, 2021.
- Humman: Multi-modal 4d human dataset for versatile sensing and modeling. In ECCV, 2022.
- Exploring simple siamese representation learning. In CVPR, 2021.
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
- Soma: Solving optical marker-based mocap automatically. In ICCV, 2021.
- Imos: Intent-driven full-body motion synthesis for human-object interactions. In Computer Graphics Forum, 2023.
- Tm2d: Bimodality driven 3d dance generation via music-text integration. In ICCV, 2023.
- Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In ECCV, 2022.
- Reinforcement learning with deep energy-based policies. In ICML, 2017.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
- Dance revolution: Long-term dance generation with music via curriculum learning. In ICLR, 2021.
- Interactive synthesis of human-object interaction. In SIGGRAPH, 2009.
- Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795, 2023.
- Andrej Karpathy. https://github.com/karpathy/minGPT, 2020.
- Adam: A method for stochastic optimization. In ICLR, 2014.
- Music-driven group choreography. In CVPR, 2023.
- Dancing to music. In NeurIPS, 2019.
- Sergey Levine. https://rail.eecs.berkeley.edu/deeprlcourse/.
- Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In AAAI, 2022.
- Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In ICCV, 2021a.
- Task-oriented human-object interactions generation with implicit neural representations. arXiv preprint arXiv:2303.13129, 2023.
- Ai choreographer: Music conditioned 3d dance generation with aist++. In ICCV, 2021b.
- Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
- Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE TPAMI, 42(10), 2020.
- Amass: Archive of motion capture as surface shapes. In ICCV, 2019.
- librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, 2015.
- Gan-based reactive motion synthesis with class-aware discriminators for human–human interaction. CG, 102, 2022.
- Efficient content-based retrieval of motion capture data. In SIGGRAPH. 2005.
- You2me: Inferring body pose in egocentric video via first and second person interactions. In CVPR, 2020.
- Learning to listen: Modeling non-deterministic dyadic facial motion. In CVPR, 2022.
- Fmdistance: A fast and effective distance function for motion capture data. In Eurographics, 2008.
- Training language models to follow instructions with human feedback. NeurIPS, 2022.
- Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
- Action-conditioned 3d human motion synthesis with transformer vae. In ICCV, 2021.
- Temos: Generating diverse human motions from textual descriptions. In ECCV, 2022.
- Self-supervised dance video synthesis conditioned on music. In ACM MM, 2020.
- Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610, 2022.
- Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
- Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In CVPR, 2022.
- Deepdance: music-to-dance motion choreography with adversarial learning. TMM, 23, 2020.
- You never stop dancing: Non-freezing dance generation via bank-constrained manifold projection. In NeurIPS, 2022.
- Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In ACM MM, 2018.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
- Edge: editable dance generation from music. In CVPR, 2023.
- Transflower: probabilistic autoregressive dance generation with multimodal attention. ACM TOG, 40(6), 2021.
- Neural discrete representation learning. NeurIPS, 2017.
- Umpm benchmark: A multi-person dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction. In ICCVW, 2011.
- Attention is all you need. ArXiv, abs/1706.03762, 2017.
- Somoformer: Multi-person pose forecasting with transformers. arXiv preprint arXiv:2208.14023, 2022.
- Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models. In IJCAI, 2023a.
- Synbody: Synthetic dataset with layered human models for 3d human perception and modeling. In ICCV, 2023b.
- Human-aware object placement for visual environment reconstruction. In CVPR, 2022.
- Mime: Human-aware 3d scene generation. In CVPR, 2023.
- T2m-gpt: Generating human motion from textual descriptions with discrete representations. In CVPR, 2023a.
- Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
- Remodiffuse: Retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116, 2023b.
- Compositional human-scene interaction synthesis with semantic control. In ECCV, 2022.
- Music2dance: Dancenet for music-driven dance generation. ACM TOMM, 18(2), 2022.