QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation (2403.11626v1)
Abstract: The study of music-generated dance is a novel and challenging Image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This can lead to issues like joint deformation, role deviation, floating, and inconsistencies in dance movements generated in response to the music. In this paper, we propose a Quaternion-Enhanced Attention Network (QEAN) for visual dance synthesis from a quaternion perspective, which consists of a Spin Position Embedding (SPE) module and a Quaternion Rotary Attention (QRA) module. First, SPE embeds position information into self-attention in a rotational manner, leading to better learning of features of movement sequences and audio sequences, and improved understanding of the connection between music and dance. Second, QRA represents and fuses 3D motion features and audio features in the form of a series of quaternions, enabling the model to better learn the temporal coordination of music and dance under the complex temporal cycle conditions of dance generation. Finally, we conducted experiments on the dataset AIST++, and the results show that our approach achieves better and more robust performance in generating accurate, high-quality dance movements. Our source code and dataset can be available from https://github.com/MarasyZZ/QEAN and https://google.github.io/aistplusplus_dataset respectively.
- “Cultural thought and philosophical elements of singing and dancing in Indian films” In Trans/Form/Ação 46, 2023, pp. 315–328 DOI: 10.1590/0101-3173.2023.v46n4.p315
- Mark Siciliano “A citation analysis of business librarianship: Examining the Journal of Business and Finance Librarianship from 1990–2014” In Journal of Business & Finance Librarianship 22, 2017, pp. 81–96 URL: https://api.semanticscholar.org/CorpusID:63474056
- “Style-based motion analysis for dance composition” In The Visual Computer 34, 2018, pp. 1725–1737 URL: https://api.semanticscholar.org/CorpusID:27531229
- “Learning to Generate Diverse Dance Motions with Transformer” In ArXiv abs/2008.08171, 2020 URL: https://api.semanticscholar.org/CorpusID:221173065
- “Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning” In International Conference on Learning Representations, 2020 URL: https://api.semanticscholar.org/CorpusID:235614403
- “Dance Generation with Style Embedding: Learning and Transferring Latent Representations of Dance Styles” In ArXiv abs/2104.14802, 2021 URL: https://api.semanticscholar.org/CorpusID:233476346
- “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks” In ArXiv abs/1506.03099, 2015 URL: https://api.semanticscholar.org/CorpusID:1820089
- “BaGFN: Broad Attentive Graph Fusion Network for High-Order Feature Interactions” In IEEE Transactions on Neural Networks and Learning Systems 34, 2021, pp. 4499–4513 URL: https://api.semanticscholar.org/CorpusID:238476689
- “Learning Individual Styles of Conversational Gesture” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3492–3501 URL: https://api.semanticscholar.org/CorpusID:182952539
- “Improving Video Temporal Consistency via Broad Learning System” In IEEE Transactions on Cybernetics 52.7, 2022, pp. 6662–6675 DOI: 10.1109/TCYB.2021.3079311
- “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” In ArXiv abs/2010.11929, 2020 URL: https://api.semanticscholar.org/CorpusID:225039882
- “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10002 URL: https://api.semanticscholar.org/CorpusID:232352874
- “Attention is All you Need” In Neural Information Processing Systems, 2017 URL: https://api.semanticscholar.org/CorpusID:13756489
- “EAPT: Efficient Attention Pyramid Transformer for Image Processing” In IEEE Transactions on Multimedia 25, 2021, pp. 50–61 URL: https://api.semanticscholar.org/CorpusID:245536278
- “High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 12449–12458
- “A Large-Scale Film Style Dataset for Learning Multi-frequency Driven Film Enhancement” Main Track In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23 International Joint Conferences on Artificial Intelligence Organization, 2023, pp. 1160–1168 DOI: 10.24963/ijcai.2023/129
- “Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer With Adaptive Channel Expansion” In arXiv preprint arXiv:2308.13739, 2023
- “AI Choreographer: Music Conditioned 3D Dance Generation with AIST++” In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13381–13392 URL: https://api.semanticscholar.org/CorpusID:236882798
- “Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory” In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11040–11049 URL: https://api.semanticscholar.org/CorpusID:247627867
- “Modeling Human Motion with Quaternion-Based Neural Networks” In International Journal of Computer Vision 128, 2019, pp. 855–872 URL: https://api.semanticscholar.org/CorpusID:59158790
- “PCMG:3D point cloud human motion generation based on self-attention and transformer” In The Visual Computer, 2023 URL: https://api.semanticscholar.org/CorpusID:261566852
- David Greenwood, Stephen D. Laycock and Iain Matthews “Predicting Head Pose from Speech with a Conditional Variational Autoencoder” In Interspeech, 2017 URL: https://api.semanticscholar.org/CorpusID:11113871
- “Genre-Conditioned Long-Term 3D Dance Generation Driven by Music” In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4858–4862 URL: https://api.semanticscholar.org/CorpusID:249437513
- “Long Short-Term Memory” In Neural Computation 9, 1997, pp. 1735–1780 URL: https://api.semanticscholar.org/CorpusID:1915014
- “Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP” In ArXiv abs/2308.02487, 2023 URL: https://api.semanticscholar.org/CorpusID:260611350
- “Multimodal Transformer for Unaligned Multimodal Language Sequences” In Proceedings of the conference. Association for Computational Linguistics. Meeting 2019, 2019, pp. 6558–6569 URL: https://api.semanticscholar.org/CorpusID:173990158
- “EasyPhoto: Your Smart AI Photo Generator”, 2023 URL: https://api.semanticscholar.org/CorpusID:263829612
- “Feel The Music: Automatically Generating A Dance For An Input Song” In ArXiv abs/2006.11905, 2020 URL: https://api.semanticscholar.org/CorpusID:219572850
- “Cross-Conditioned Recurrent Networks for Long-Term Synthesis of Inter-Person Human Motion Interactions” In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 2713–2722 URL: https://api.semanticscholar.org/CorpusID:214675800
- “VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation” In ArXiv abs/2106.04632, 2021 URL: https://api.semanticscholar.org/CorpusID:235377363
- “Learning Human Motion Models for Long-Term Predictions” In 2017 International Conference on 3D Vision (3DV), 2017, pp. 458–466 URL: https://api.semanticscholar.org/CorpusID:13549534
- “Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models” In ArXiv abs/2303.04671, 2023 URL: https://api.semanticscholar.org/CorpusID:257404891
- “GLM: General Language Model Pretraining with Autoregressive Blank Infilling” In Annual Meeting of the Association for Computational Linguistics, 2021 URL: https://api.semanticscholar.org/CorpusID:247519241
- “Low-rank multimodal fusion algorithm based on context modeling” In Journal of Internet Technology 22.4, 2021, pp. 913–921
- Daniel Holden, Jun Saito and Taku Komura “A deep learning framework for character motion synthesis and editing” In ACM Transactions on Graphics (TOG) 35, 2016, pp. 1–11 URL: https://api.semanticscholar.org/CorpusID:18149328
- “Cross View Fusion for 3D Human Pose Estimation” In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4341–4350 URL: https://api.semanticscholar.org/CorpusID:201891326
- “Quantized GAN for Complex Music Generation from Dance Videos” In ArXiv abs/2204.00604, 2022 URL: https://api.semanticscholar.org/CorpusID:247922422
- “Quaternion-Valued Correlation Learning for Few-Shot Semantic Segmentation” In IEEE Transactions on Circuits and Systems for Video Technology 33, 2023, pp. 2102–2115 URL: https://api.semanticscholar.org/CorpusID:253661872
- “RoFormer: Enhanced Transformer with Rotary Position Embedding” In ArXiv abs/2104.09864, 2021 URL: https://api.semanticscholar.org/CorpusID:233307138
- “AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Database for Dance Information Processing” In International Society for Music Information Retrieval Conference, 2019 URL: https://api.semanticscholar.org/CorpusID:208334750
- “librosa: Audio and Music Signal Analysis in Python” In SciPy, 2015 URL: https://api.semanticscholar.org/CorpusID:33504
- “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium” In Neural Information Processing Systems, 2017 URL: https://api.semanticscholar.org/CorpusID:326772
- Kensuke Onuma, Christos Faloutsos and Jessica K. Hodgins “FMDistance: A Fast and Effective Distance Function for Motion Capture Data” In Eurographics, 2008 URL: https://api.semanticscholar.org/CorpusID:8323054
- Hao Hao Tan and Mohit Bansal “LXMERT: Learning Cross-Modality Encoder Representations from Transformers” In Conference on Empirical Methods in Natural Language Processing, 2019 URL: https://api.semanticscholar.org/CorpusID:201103729
- Zhizhen Zhou (16 papers)
- Yejing Huo (3 papers)
- Guoheng Huang (12 papers)
- An Zeng (55 papers)
- Xuhang Chen (34 papers)
- Lian Huang (3 papers)
- Zinuo Li (13 papers)