Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation (2403.11626v1)

Published 18 Mar 2024 in cs.GR, cs.AI, cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: The study of music-generated dance is a novel and challenging Image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This can lead to issues like joint deformation, role deviation, floating, and inconsistencies in dance movements generated in response to the music. In this paper, we propose a Quaternion-Enhanced Attention Network (QEAN) for visual dance synthesis from a quaternion perspective, which consists of a Spin Position Embedding (SPE) module and a Quaternion Rotary Attention (QRA) module. First, SPE embeds position information into self-attention in a rotational manner, leading to better learning of features of movement sequences and audio sequences, and improved understanding of the connection between music and dance. Second, QRA represents and fuses 3D motion features and audio features in the form of a series of quaternions, enabling the model to better learn the temporal coordination of music and dance under the complex temporal cycle conditions of dance generation. Finally, we conducted experiments on the dataset AIST++, and the results show that our approach achieves better and more robust performance in generating accurate, high-quality dance movements. Our source code and dataset can be available from https://github.com/MarasyZZ/QEAN and https://google.github.io/aistplusplus_dataset respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. “Cultural thought and philosophical elements of singing and dancing in Indian films” In Trans/Form/Ação 46, 2023, pp. 315–328 DOI: 10.1590/0101-3173.2023.v46n4.p315
  2. Mark Siciliano “A citation analysis of business librarianship: Examining the Journal of Business and Finance Librarianship from 1990–2014” In Journal of Business & Finance Librarianship 22, 2017, pp. 81–96 URL: https://api.semanticscholar.org/CorpusID:63474056
  3. “Style-based motion analysis for dance composition” In The Visual Computer 34, 2018, pp. 1725–1737 URL: https://api.semanticscholar.org/CorpusID:27531229
  4. “Learning to Generate Diverse Dance Motions with Transformer” In ArXiv abs/2008.08171, 2020 URL: https://api.semanticscholar.org/CorpusID:221173065
  5. “Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning” In International Conference on Learning Representations, 2020 URL: https://api.semanticscholar.org/CorpusID:235614403
  6. “Dance Generation with Style Embedding: Learning and Transferring Latent Representations of Dance Styles” In ArXiv abs/2104.14802, 2021 URL: https://api.semanticscholar.org/CorpusID:233476346
  7. “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks” In ArXiv abs/1506.03099, 2015 URL: https://api.semanticscholar.org/CorpusID:1820089
  8. “BaGFN: Broad Attentive Graph Fusion Network for High-Order Feature Interactions” In IEEE Transactions on Neural Networks and Learning Systems 34, 2021, pp. 4499–4513 URL: https://api.semanticscholar.org/CorpusID:238476689
  9. “Learning Individual Styles of Conversational Gesture” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3492–3501 URL: https://api.semanticscholar.org/CorpusID:182952539
  10. “Improving Video Temporal Consistency via Broad Learning System” In IEEE Transactions on Cybernetics 52.7, 2022, pp. 6662–6675 DOI: 10.1109/TCYB.2021.3079311
  11. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” In ArXiv abs/2010.11929, 2020 URL: https://api.semanticscholar.org/CorpusID:225039882
  12. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10002 URL: https://api.semanticscholar.org/CorpusID:232352874
  13. “Attention is All you Need” In Neural Information Processing Systems, 2017 URL: https://api.semanticscholar.org/CorpusID:13756489
  14. “EAPT: Efficient Attention Pyramid Transformer for Image Processing” In IEEE Transactions on Multimedia 25, 2021, pp. 50–61 URL: https://api.semanticscholar.org/CorpusID:245536278
  15. “High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 12449–12458
  16. “A Large-Scale Film Style Dataset for Learning Multi-frequency Driven Film Enhancement” Main Track In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23 International Joint Conferences on Artificial Intelligence Organization, 2023, pp. 1160–1168 DOI: 10.24963/ijcai.2023/129
  17. “Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer With Adaptive Channel Expansion” In arXiv preprint arXiv:2308.13739, 2023
  18. “AI Choreographer: Music Conditioned 3D Dance Generation with AIST++” In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13381–13392 URL: https://api.semanticscholar.org/CorpusID:236882798
  19. “Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory” In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11040–11049 URL: https://api.semanticscholar.org/CorpusID:247627867
  20. “Modeling Human Motion with Quaternion-Based Neural Networks” In International Journal of Computer Vision 128, 2019, pp. 855–872 URL: https://api.semanticscholar.org/CorpusID:59158790
  21. “PCMG:3D point cloud human motion generation based on self-attention and transformer” In The Visual Computer, 2023 URL: https://api.semanticscholar.org/CorpusID:261566852
  22. David Greenwood, Stephen D. Laycock and Iain Matthews “Predicting Head Pose from Speech with a Conditional Variational Autoencoder” In Interspeech, 2017 URL: https://api.semanticscholar.org/CorpusID:11113871
  23. “Genre-Conditioned Long-Term 3D Dance Generation Driven by Music” In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4858–4862 URL: https://api.semanticscholar.org/CorpusID:249437513
  24. “Long Short-Term Memory” In Neural Computation 9, 1997, pp. 1735–1780 URL: https://api.semanticscholar.org/CorpusID:1915014
  25. “Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP” In ArXiv abs/2308.02487, 2023 URL: https://api.semanticscholar.org/CorpusID:260611350
  26. “Multimodal Transformer for Unaligned Multimodal Language Sequences” In Proceedings of the conference. Association for Computational Linguistics. Meeting 2019, 2019, pp. 6558–6569 URL: https://api.semanticscholar.org/CorpusID:173990158
  27. “EasyPhoto: Your Smart AI Photo Generator”, 2023 URL: https://api.semanticscholar.org/CorpusID:263829612
  28. “Feel The Music: Automatically Generating A Dance For An Input Song” In ArXiv abs/2006.11905, 2020 URL: https://api.semanticscholar.org/CorpusID:219572850
  29. “Cross-Conditioned Recurrent Networks for Long-Term Synthesis of Inter-Person Human Motion Interactions” In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 2713–2722 URL: https://api.semanticscholar.org/CorpusID:214675800
  30. “VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation” In ArXiv abs/2106.04632, 2021 URL: https://api.semanticscholar.org/CorpusID:235377363
  31. “Learning Human Motion Models for Long-Term Predictions” In 2017 International Conference on 3D Vision (3DV), 2017, pp. 458–466 URL: https://api.semanticscholar.org/CorpusID:13549534
  32. “Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models” In ArXiv abs/2303.04671, 2023 URL: https://api.semanticscholar.org/CorpusID:257404891
  33. “GLM: General Language Model Pretraining with Autoregressive Blank Infilling” In Annual Meeting of the Association for Computational Linguistics, 2021 URL: https://api.semanticscholar.org/CorpusID:247519241
  34. “Low-rank multimodal fusion algorithm based on context modeling” In Journal of Internet Technology 22.4, 2021, pp. 913–921
  35. Daniel Holden, Jun Saito and Taku Komura “A deep learning framework for character motion synthesis and editing” In ACM Transactions on Graphics (TOG) 35, 2016, pp. 1–11 URL: https://api.semanticscholar.org/CorpusID:18149328
  36. “Cross View Fusion for 3D Human Pose Estimation” In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4341–4350 URL: https://api.semanticscholar.org/CorpusID:201891326
  37. “Quantized GAN for Complex Music Generation from Dance Videos” In ArXiv abs/2204.00604, 2022 URL: https://api.semanticscholar.org/CorpusID:247922422
  38. “Quaternion-Valued Correlation Learning for Few-Shot Semantic Segmentation” In IEEE Transactions on Circuits and Systems for Video Technology 33, 2023, pp. 2102–2115 URL: https://api.semanticscholar.org/CorpusID:253661872
  39. “RoFormer: Enhanced Transformer with Rotary Position Embedding” In ArXiv abs/2104.09864, 2021 URL: https://api.semanticscholar.org/CorpusID:233307138
  40. “AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Database for Dance Information Processing” In International Society for Music Information Retrieval Conference, 2019 URL: https://api.semanticscholar.org/CorpusID:208334750
  41. “librosa: Audio and Music Signal Analysis in Python” In SciPy, 2015 URL: https://api.semanticscholar.org/CorpusID:33504
  42. “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium” In Neural Information Processing Systems, 2017 URL: https://api.semanticscholar.org/CorpusID:326772
  43. Kensuke Onuma, Christos Faloutsos and Jessica K. Hodgins “FMDistance: A Fast and Effective Distance Function for Motion Capture Data” In Eurographics, 2008 URL: https://api.semanticscholar.org/CorpusID:8323054
  44. Hao Hao Tan and Mohit Bansal “LXMERT: Learning Cross-Modality Encoder Representations from Transformers” In Conference on Empirical Methods in Natural Language Processing, 2019 URL: https://api.semanticscholar.org/CorpusID:201103729
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhizhen Zhou (16 papers)
  2. Yejing Huo (3 papers)
  3. Guoheng Huang (12 papers)
  4. An Zeng (55 papers)
  5. Xuhang Chen (34 papers)
  6. Lian Huang (3 papers)
  7. Zinuo Li (13 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets