Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LatentMan: Generating Consistent Animated Characters using Image Diffusion Models (2312.07133v2)

Published 12 Dec 2023 in cs.CV and cs.LG

Abstract: We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos with continuous motion. We strive to bridge this gap, and we introduce LatentMan, which leverages existing text-based motion diffusion models to generate diverse continuous motions to guide the T2I model. To boost the temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies between frames. Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference. Project page https://abdo-eldesokey.github.io/latentman/.

Zero-Shot Synthesis of Animated Characters

Introduction to Zero-Shot Video Synthesis

The generation of video content featuring animated characters based on textual descriptions has substantial value across multiple industries, including entertainment and virtual reality. Classical approaches to creating Text-to-Video (T2V) content rely on extensive training on large datasets which can be costly and computationally demanding. Overcoming these hurdles, a zero-shot approach has been introduced for the creation of animated characters without the need for dedicated training. This method builds upon pre-trained Text-to-Image (T2I) diffusion models, commonly used for still image generation, to produce temporally consistent video sequences.

Methodology for Consistency

In pursuit of temporal consistency—a challenge for zero-shot T2V generation—the proposed method engages a zero-shot learning paradigm, leveraging existing text-based motion diffusion models. A sequence of guidance signals derived from textual inputs directs the T2I model in video frame generation. Key to maintaining consistency is the Spatial Latent Alignment module, responsible for aligning the latent codes that represent various elements across video frames. The Pixel-Wise Guidance strategy refines this alignment, steering the diffusion process to minimize disparities across frames. Moreover, a novel metric called Human Mean Squared Error has been introduced for measuring temporal consistency.

Technical Framework

The texture and form of an animated character must remain coherent throughout a video sequence for the content to be perceived as lifelike and high-quality. To ensure this, the authors designed a process that renders human poses into depth maps, leveraged as guidance for a pre-trained diffusion model. Dense correspondences between frames are calculated to line up the latent spaces, where the video frames are generated. This meticulous alignment guards against the temporal inconsistencies previously noted in similar zero-shot approaches.

Results and Contributions

The approach has shown a clear advantage over existing methods, enhancing pixel-wise consistency and garnering stronger user preference in the authors' studies. It presents a significant step forward in the ability to render diverse animated characters performing complex movements within dynamic environments. The method's key contributions include the integrated Spatial Latent Alignment and Pixel-Wise Guidance and the introduction of the new Human Mean Squared Error metric demonstrating a 10% improvement in temporal consistency.

Limitations and Future Directions

The research acknowledges the method's limitations related to the reliance on depth conditioning and the challenges of achieving perfect cross-frame correspondences. These imperfections can sometimes result in textured inconsistencies within the generated video content. Despite these challenges, even the Spatial Latent Alignment component alone has shown remarkable improvements. Looking into the future, refining the cross-frame correspondences could bring more precise alignment, improving the realism and fidelity of the generated animated characters. Furthermore, integrating background dynamics could enhance the overall realism of the videos.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Pose with Style: Detail-preserving pose-guided image synthesis with conditional stylegan. ACM Transactions on Graphics, 2021.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  3. Person image synthesis via denoising diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5968–5976, 2023.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  5. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
  6. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023.
  7. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  8. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
  9. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018.
  10. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022.
  11. Text2performer: Text-driven human video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  12. Gmd: Controllable human motion synthesis via guided diffusion models. arXiv preprint arXiv:2305.12577, 2023.
  13. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  14. Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  15. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  16. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023.
  17. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
  18. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  19. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  20. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  21. Deep image spatial transformation for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7690–7699, 2020.
  22. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  23. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  24. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  25. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  26. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  27. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  28. Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2023.
  29. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  30. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023.
  31. Adding conditional control to text-to-image diffusion models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Abdelrahman Eldesokey (15 papers)
  2. Peter Wonka (130 papers)
Citations (3)
Github Logo Streamline Icon: https://streamlinehq.com