MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence
Abstract: Motion-to-music and music-to-motion have been studied separately, each attracting substantial research interest within their respective domains. The interaction between human motion and music is a reflection of advanced human intelligence, and establishing a unified relationship between them is particularly important. However, to date, there has been no work that considers them jointly to explore the modality alignment within. To bridge this gap, we propose a novel framework, termed MoMu-Diffusion, for long-term and synchronous motion-music generation. Firstly, to mitigate the huge computational costs raised by long sequences, we propose a novel Bidirectional Contrastive Rhythmic Variational Auto-Encoder (BiCoR-VAE) that extracts the modality-aligned latent representations for both motion and music inputs. Subsequently, leveraging the aligned latent spaces, we introduce a multi-modal Transformer-based diffusion model and a cross-guidance sampling strategy to enable various generation tasks, including cross-modal, multi-modal, and variable-length generation. Extensive experiments demonstrate that MoMu-Diffusion surpasses recent state-of-the-art methods both qualitatively and quantitatively, and can synthesize realistic, diverse, long-term, and beat-matched music or motion sequences. The generated samples and codes are available at https://momu-diffusion.github.io/
- Listen, denoise, action! audio-driven motion synthesis with diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–20, 2023.
- Midi-vae: Modeling dynamics and instrumentation of music with applications to style transfer. arXiv preprint arXiv:1809.07600, 2018.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- Visual rhythm and beat. ACM Transactions on Graphics (TOG), 37(4):1–11, 2018.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
- Video background music generation with controllable music transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 2037–2045, 2021.
- Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Multitrack music transformer. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
- Example-based automatic music-driven conventional dance motion synthesis. IEEE transactions on visualization and computer graphics, 18(3):501–515, 2011.
- Foley music: Learning to generate music from videos. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 758–775. Springer, 2020.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. IEEE, 2017.
- Cyclic tempogram—a mid-level tempo representation for musicsignals. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5522–5525. IEEE, 2010.
- Multi-instrument music synthesis with spectrogram diffusion. arXiv preprint arXiv:2206.05408, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Dance revolution: Long-term dance generation with music via curriculum learning. arXiv preprint arXiv:2006.06119, 2020.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning, pp. 13916–13932. PMLR, 2023.
- Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In Proceedings of the 28th ACM international conference on multimedia, pp. 1180–1188, 2020.
- Fr\\\backslash\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
- LaMothe, K. The dancing species: how moving together in time helps make us human. Aeon, June, 1:1, 2019.
- Dancing to music. Advances in neural information processing systems, 32, 2019.
- Music similarity-based approach to generating dance motion sequence. Multimedia tools and applications, 62:895–912, 2013.
- Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658, 2022.
- Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171, 2020.
- Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13401–13412, 2021.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
- Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10462–10472, 2022.
- Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
- Symbolic music generation with diffusion models. arXiv preprint arXiv:2103.16091, 2021.
- Learn2dance: Learning statistical music-to-dance mappings for choreography synthesis. IEEE Transactions on Multimedia, 14(3):747–759, 2011.
- Musika! fast infinite waveform music generation. arXiv preprint arXiv:2208.08706, 2022.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
- Popmag: Pop music accompaniment generation. In Proceedings of the 28th ACM international conference on multimedia, pp. 1198–1206, 2020.
- A hierarchical latent vector model for learning long-term structure in music. In International conference on machine learning, pp. 4364–4373. PMLR, 2018.
- Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10219–10228, 2023.
- Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2616–2625, 2020.
- Audeo: Audio generation for a silent performance video. Advances in Neural Information Processing Systems, 33:3325–3337, 2020.
- Aist dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information processing. In ISMIR, volume 1, pp.  6, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Skating-mixer: Multimodal mlp for scoring figure skating. arXiv preprint arXiv:2203.03990, 2022.
- Learning to score figure skating sport videos. IEEE transactions on circuits and systems for video technology, 30(12):4578–4590, 2019.
- Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG), 39(6):1–16, 2020.
- Long-term rhythmic video soundtracker. In International Conference on Machine Learning, pp. 40339–40353. PMLR, 2023.
- Taming diffusion models for audio-driven co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10544–10553, 2023a.
- Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15085–15099, 2023b.
- Quantized gan for complex music generation from dance videos. In European Conference on Computer Vision, pp. 182–199. Springer, 2022a.
- Discrete contrastive diffusion for cross-modal music and image generation. arXiv preprint arXiv:2206.07771, 2022b.
- Music2dance: Music-driven dance generation using wavenet. arXiv preprint arXiv:2002.03761, 3(4):6, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.