Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation (2307.00574v5)
Abstract: We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise. This problem has been formulated as modeling of an auto-regressive generation, i.e., to regress past frames to decode future frames. However, such unidirectional generation is highly prone to motion drifting over time, generating unrealistic human animation with significant artifacts such as appearance distortion. We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance. To prove our claim, we design a novel human animation framework using a denoising diffusion model: a neural network learns to generate the image of a person by denoising temporal Gaussian noises whose intermediate results are cross-conditioned bidirectionally between consecutive frames. In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence.
- Adobe. Adobe mixamo. https://www.mixamo.com/#/.
- Message passing algorithms and improved lp decoding. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pp. 3–12, 2009.
- Synthesizing images of humans in unseen poses. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8340–8348, 2018.
- Everybody dance now. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 5933–5942, 2019.
- Learning temporal coherence via self-supervision for gan-based video generation. ACM Transactions on Graphics (TOG), 39(4):75–1, 2020.
- A variational u-net for conditional appearance and shape generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8857–8866, 2018.
- Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7346–7356, 2023.
- Dynamic programming and graph algorithms in computer vision. IEEE transactions on pattern analysis and machine intelligence, 33(4):721–740, 2010.
- Instance-level human parsing via part grouping network. In Proceedings of the European conference on computer vision (ECCV), pp. 770–785, 2018.
- Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7297–7306, 2018.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
- Video diffusion models. arXiv preprint arXiv:2204.03458, 2022b.
- Few-shot human motion transfer by personalized geometry and texture modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2297–2306, 2021.
- Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134, 2017.
- High-fidelity neural human motion transfer from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1541–1550, 2021.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410, 2019.
- Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
- Dense intrinsic appearance flow for human pose transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3693–3702, 2019.
- Neural rendering and reenactment of human actor videos. ACM Transactions on Graphics (TOG), 38(5):1–14, 2019a.
- Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5904–5913, 2019b.
- Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
- Dense pose transfer. In Proceedings of the European conference on computer vision (ECCV), pp. 123–138, 2018.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
- Nvidia. Nvidia omniverse. https://www.nvidia.com/en-us/omniverse/.
- Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Sony Pictures. Lucasfilm and sony pictures imageworks release alembic 1.0. Sony Pictures Imageworks, Lucasfilm (August 9, 2011).
- Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10619–10629, 2022.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Reallusion. Character creator. https://www.reallusion.com/character-creator/, a.
- Reallusion. iclone8. https://www.reallusion.com/iclone/, b.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022a.
- Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022b.
- Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2021.
- Humangan: A generative model of human images. In 2021 International Conference on 3D Vision (3DV), pp. 258–267. IEEE, 2021.
- First order motion model for image animation. Advances in Neural Information Processing Systems, 32, 2019.
- Motion representations for articulated animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13653–13662, 2021.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Denoising diffusion implicit models. In International Conference on Learning Representations, 2020a.
- Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.
- Few-shot video-to-video synthesis. Advances in Neural Information Processing Systems, 32, 2019.
- Dance in the wild: Monocular human animation with neural dynamic appearance synthesis. In 2021 International Conference on 3D Vision (3DV), pp. 268–277. IEEE, 2021.
- Zero-shot image restoration using denoising diffusion null-space model. In The Eleventh International Conference on Learning Representations, 2022.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- Novel view synthesis with diffusion models. In The Eleventh International Conference on Learning Representations, 2022.
- Diffusion probabilistic modeling for video generation. Entropy, 25(10):1469, 2023.
- Learning motion-dependent appearance for high-fidelity rendering of dynamic humans from a single camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3407–3417, 2022.
- Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139, 2019.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
- Thin-plate spline motion model for image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3657–3666, 2022.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Dance dance generation: Motion transfer for internet videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0, 2019.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.