Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ID-Animator: Zero-Shot Identity-Preserving Human Video Generation (2404.15275v3)

Published 23 Apr 2024 in cs.CV

Abstract: Generating high-fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case fine-tuning or usually missing identity details in the video generation process. In this study, we present \textbf{ID-Animator}, a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline that incorporates unified human attributes and action captioning techniques from a constructed facial image pool. Based on this pipeline, a random reference training strategy is further devised to precisely capture the ID-relevant embeddings with an ID-preserving loss, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Extensive experiments demonstrate the superiority of ID-Animator to generate personalized human videos over previous models. Moreover, our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications for video generation where identity preservation is highly desired. Our codes and checkpoints are released at https://github.com/ID-Animator/ID-Animator.

Zero-Shot Identity-Preserving Human Video Generation with ID-Animator

The paper presents ID-Animator, a zero-shot approach capable of generating personalized human videos while preserving the identity of the input facial image without additional model tuning. This work addresses key challenges in identity-specific video generation, particularly the balancing act between training efficiency and identity fidelity, by leveraging a diffusion-based video generation architecture augmented with a face adapter module.

Methodology

ID-Animator operates on a robust framework that integrates pre-trained text-to-video diffusion models with a lightweight face adapter. This adapter encodes identity-relevant embeddings from input facial images. The paper also introduces an ID-oriented dataset construction pipeline, facilitating identity extraction through decoupled human attribute and action captioning. This is further enhanced by a random face reference training method, which improves model fidelity and generalization by isolating identity-related features from extraneous details in reference images.

Dataset Construction

A significant contribution of this paper is the ID-oriented dataset reconstruction, based on publicly available datasets. The authors implement a decoupled captioning strategy, isolating human attributes and actions to generate comprehensive textual descriptions. This is coupled with a facial image pool to provide more precise facial embeddings. The dataset construction pipeline overcomes the dearth of suitable high-quality training sets for identity-preserving video generation.

Experimental Results

The extensive experiments conducted demonstrate the superiority of ID-Animator in generating high-fidelity, identity-preserving videos when benchmarked against existing methods. The compatibility of ID-Animator with various pre-trained text-to-video models like AnimateDiff, along with its adaptability to community models, underscores its practical applicability. The framework’s extendability in real-world video generation scenarios is particularly notable, allowing for significant flexibility in integrating with other models to achieve desired generative outcomes.

Implications and Future Directions

This research holds substantial implications for fields such as film production, where identity fidelity in character portrayal is crucial. By enabling efficient and faithful identity-specific video generation without per-character tuning, ID-Animator paves the way for streamlined content creation pipelines. The paper also points towards future exploration in enhancing the robustness of zero-shot models, potentially incorporating more sophisticated facial recognition and attribute extraction techniques to broaden applicability across diverse identity-specific tasks.

Conclusion

ID-Animator represents a significant advancement in zero-shot human video generation, combining efficiency and fidelity in maintaining character identity. This work lays a foundation for future AI-driven innovations in personalized content generation, offering a practical solution to long-standing challenges in identity preservation during video synthesis. The release of the code and checkpoints aligns with fostering further research and development in the domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  2. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023a.
  3. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023b.
  4. Learning temporal coherence via self-supervision for gan-based video generation. ACM Transactions on Graphics (TOG), 39(4):75–1, 2020.
  5. Civitai. Civitai. https://civitai.com/. Accessed: April 21, 2024.
  6. Animateanything: Fine-grained open domain image animation with motion guidance. arXiv e-prints, pages arXiv–2311, 2023.
  7. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  8. Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pages 102–118. Springer, 2022.
  9. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023a.
  10. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023b.
  11. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  12. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  13. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022b.
  14. Depth-aware generative adversarial network for talking head video generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3397–3406, 2022.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  16. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023.
  17. Make it move: controllable image-to-video generation with text descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18219–18228, 2022.
  18. Videobooth: Diffusion-based video generation with image prompts. arXiv preprint arXiv:2312.00777, 2023a.
  19. Text2performer: Text-driven human video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22747–22757, 2023b.
  20. Photomaker: Customizing realistic human photos via stacked id embedding. arXiv preprint arXiv:2312.04461, 2023.
  21. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  22. Stylecrafter: Enhancing stylized text-to-video generation with style adapter. arXiv preprint arXiv:2312.00330, 2023.
  23. Magic-me: Identity-specific video customized diffusion. arXiv preprint arXiv:2402.09368, 2024.
  24. Strait: Non-autoregressive generation with stratified image transformer. arXiv preprint arXiv:2303.00750, 2023.
  25. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  26. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  27. Face0: Instantaneously conditioning a text-to-image model on a face. In SIGGRAPH Asia 2023 Conference Papers, pages 1–10, 2023.
  28. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
  29. Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024.
  30. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  31. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
  32. Do you guys want to dance: Zero-shot compositional human dance generation with multiple persons. arXiv preprint arXiv:2401.13363, 2024.
  33. Facestudio: Put your face everywhere in seconds. arXiv preprint arXiv:2312.02663, 2023.
  34. H. Ye. IP-Adapter Plus Face. https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-plus-face_sd15.bin, 2024a. Accessed on: 2024-04-19.
  35. H. Ye. IP-Adapter FaceID Portrait V11 SD15. https://huggingface.co/h94/IP-Adapter-FaceID/blob/main/ip-adapter-faceid-portrait-v11_sd15.bin, 2024b. Accessed on: 2024-04-19.
  36. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
  37. Celebv-text: A large-scale facial text-video dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14805–14814, 2023.
  38. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  39. General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18697–18709, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xuanhua He (11 papers)
  2. Quande Liu (24 papers)
  3. Shengju Qian (16 papers)
  4. Xin Wang (1307 papers)
  5. Tao Hu (146 papers)
  6. Ke Cao (12 papers)
  7. Keyu Yan (12 papers)
  8. Jie Zhang (847 papers)
Citations (16)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com