Flexible Motion In-betweening with Diffusion Models (2405.11126v2)
Abstract: Motion in-betweening, a fundamental task in character animation, consists of generating motion sequences that plausibly interpolate user-provided keyframe constraints. It has long been recognized as a labor-intensive and challenging process. We investigate the potential of diffusion models in generating diverse human motions guided by keyframes. Unlike previous inbetweening methods, we propose a simple unified model capable of generating precise and diverse motions that conform to a flexible range of user-specified spatial constraints, as well as text conditioning. To this end, we propose Conditional Motion Diffusion In-betweening (CondMDI) which allows for arbitrary dense-or-sparse keyframe placement and partial keyframe constraints while generating high-quality motions that are diverse and coherent with the given keyframes. We evaluate the performance of CondMDI on the text-conditioned HumanML3D dataset and demonstrate the versatility and efficacy of diffusion models for keyframe in-betweening. We further explore the use of guidance and imputation-based approaches for inference-time keyframing and compare CondMDI against these methods.
- Adobe Systems Inc. 2021. Mixamo. https://www.mixamo.com Accessed: 2021-12-25.
- Text2action: Generative adversarial synthesis from language to action. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5915–5920.
- Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV). IEEE, 719–728.
- Structured prediction helps 3d human motion modelling. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7144–7153.
- Okan Arikan and David A Forsyth. 2002. Interactive motion generation from examples. ACM Transactions on Graphics (TOG) 21, 3 (2002), 483–490.
- Motion synthesis from annotations. In ACM SIGGRAPH 2003 Papers. 402–408.
- Motion-motif graphs. In Proceedings of the 2008 ACM SIGGRAPH/Eurographics symposium on computer animation. 117–126.
- Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In 2021 IEEE virtual reality and 3D user interfaces (VR). IEEE, 1–10.
- Michael Büttner and Simon Clavet. 2015. Motion matching-the road to next gen animation. Proc. of Nucl. ai 1, 2015 (2015), 2.
- Jinxiang Chai and Jessica K Hodgins. 2007. Constraint-based motion optimization using a statistical dynamic model. In ACM SIGGRAPH 2007 papers. 8–es.
- Executing your Commands via Motion Diffusion in Latent Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18000–18010.
- Mofusion: A framework for denoising-diffusion-based motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9760–9770.
- Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34 (2021), 8780–8794.
- Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014).
- Single-shot motion completion with transformer. arXiv preprint arXiv:2103.00776 (2021).
- Recurrent network models for human dynamics. In Proceedings of the IEEE international conference on computer vision. 4346–4354.
- Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF international conference on computer vision. 1396–1406.
- Learning human motion models for long-term predictions. In 2017 International Conference on 3D Vision (3DV). IEEE, 458–466.
- Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5152–5161.
- Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia. 2021–2029.
- Félix G Harvey and Christopher Pal. 2018. Recurrent transition networks for character locomotion. In SIGGRAPH Asia 2018 Technical Briefs. 1–4.
- Robust motion in-betweening. ACM Transactions on Graphics (TOG) 39, 4 (2020), 60–1.
- Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems 35 (2022), 27953–27965.
- Nemf: Neural motion fields for kinematic animation. Advances in Neural Information Processing Systems 35 (2022), 4244–4256.
- Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–14.
- Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
- Video Diffusion Models. arXiv:2204.03458 [cs.CV]
- A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG) 35, 4 (2016), 1–11.
- Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 technical briefs. 1–4.
- Example-based control of human motion. In Proceedings of the 2004 ACM SIGGRAPH/Eurographics symposium on Computer animation. 69–77.
- Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991 (2022).
- MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion. arXiv preprint arXiv:2310.14729 (2023).
- GMD: Controllable Human Motion Synthesis via Guided Diffusion Models. arXiv preprint arXiv:2305.12577 (2023).
- Guided Motion Diffusion for Controllable Human Motion Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2151–2162.
- FLAME: Free-form Language-based Motion Synthesis & Editing. arXiv preprint arXiv:2209.00349 (2022).
- Lucas Kovar and Michael Gleicher. 2004. Automated extraction and parameterization of motions in large data sets. ACM Transactions on Graphics (ToG) 23, 3 (2004), 559–568.
- Motion graphs. Vol. 21. 473–482.
- Interactive control of avatars animated with human motion data. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques. 491–500.
- Jehee Lee and Kang Hoon Lee. 2004. Precomputing avatar behavior from human motion data. In Proceedings of the 2004 ACM SIGGRAPH/Eurographics symposium on Computer animation. 79–87.
- Continuous character control with low-dimensional embeddings. ACM Transactions on Graphics (TOG) 31, 4 (2012), 1–10.
- Task-generic hierarchical human motion prior using vaes. In 2021 International Conference on 3D Vision (3DV). IEEE, 771–781.
- Auto-conditioned recurrent networks for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363 (2017).
- InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions. arXiv preprint arXiv:2304.05684 (2023).
- Character controllers using motion vaes. ACM Transactions on Graphics (TOG) 39, 4 (2020), 40–1.
- Wan-Yen Lo and Matthias Zwicker. 2008. Real-time planning for parameterized human motion. In Proceedings of the 2008 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 29–38.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11461–11471.
- AMASS: Archive of Motion Capture as Surface Shapes. In International Conference on Computer Vision. 5442–5451.
- AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision. 5442–5451.
- James McCann and Nancy Pollard. 2007. Responsive characters from motion fragments. In ACM SIGGRAPH 2007 papers. 6–es.
- Jianyuan Min and Jinxiang Chai. 2012. Motion graphs++ a compact generative model for semantic motion analysis and synthesis. ACM Transactions on Graphics (TOG) 31, 6 (2012), 1–12.
- Tomohiko Mukai and Shigeru Kuriyama. 2005. Geostatistical motion interpolation. In ACM SIGGRAPH 2005 Papers. 1062–1070.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).
- Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International conference on machine learning. PMLR, 8162–8171.
- Motion In-Betweening via Deep ΔΔ\Deltaroman_Δ-Interpolator. IEEE Transactions on Visualization and Computer Graphics (2023).
- Katherine Pullen and Christoph Bregler. 2002. Motion capture assisted animation: Texturing and synthesis. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques. 501–508.
- Motion in-betweening via two-stage transformers. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–16.
- Single Motion Diffusion. arXiv preprint arXiv:2302.05905 (2023).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
- Verbs and adverbs: Multidimensional motion interpolation. IEEE Computer Graphics and Applications 18, 5 (1998), 32–40.
- Artist-directed inverse-kinematics using radial basis function interpolation. In Computer graphics forum, Vol. 20. Wiley Online Library, 239–250.
- Alla Safonova and Jessica K Hodgins. 2007. Construction and optimal search of interpolated motion graphs. In ACM SIGGRAPH 2007 papers. 106–es.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings. 1–10.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
- Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023).
- Posture-based and action-based graphs for boxing skill visualization. Computers & Graphics 69 (2017), 104–115.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
- Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems 32 (2019).
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
- Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=SJ1kSyO2jwu
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Velocity-to-velocity human motion forecasting. Pattern Recognition 124 (2022), 108424.
- OmniControl: Control Any Joint at Any Time for Human Motion Generation. arXiv preprint arXiv:2310.08580 (2023).
- Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16010–16021.
- T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052 (2023).
- Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022).
- Xinyi Zhang and Michiel van de Panne. 2018. Data-driven autocompletion for keyframe animation. In Proceedings of the 11th ACM SIGGRAPH Conference on Motion, Interaction and Games. 1–11.
- Generative tweening: Long-term inbetweening of 3d human motions. arXiv preprint arXiv:2005.08891 (2020).