InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity (2503.16418v1)

Published 20 Mar 2025 in cs.CV and cs.LG

Abstract: Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.

Summary

InfiniteYou: Identity-Preserving Image Generation Using Diffusion Transformers

The paper entitled "InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity" presents an advanced framework for identity-preserving image generation using Diffusion Transformers (DiTs). This framework, named InfiniteYou (InfU), aims to address significant challenges in existing methods, including low identity similarity, weak text-image alignment, and compromised generation quality. The unique approach capitalizes on state-of-the-art Diffusion Transformers like FLUX to enhance image synthesis capabilities while maintaining robust identity features.

Core Contributions

InfuseNet Architecture: Central to the InfU framework is the InfuseNet component, which introduces identity features into the DiT base model using residual connections. This architecture preserves identity similarity more effectively than conventional methods that typically alter attention layers. InfuseNet functions as a generalization of ControlNet, specifically designed to separate text and identity inputs to avoid entanglement.
Multi-Stage Training Strategy: To further enhance model performance, the authors implement a multi-stage training strategy comprising pretraining and supervised fine-tuning (SFT). This methodology utilizes synthetic single-person-multiple-sample (SPMS) data to improve text-image alignment and overall generation quality, ultimately rectifying issues like face copy-pasting.
Compatibility and Plug-and-Play Design: InfU features a plug-and-play design, ensuring compatibility with various existing methods. This design contributes to its adaptability and usefulness across diverse scenarios, promoting integration with existing image generation models and plugins.

Empirical Findings and Implications

The paper reports extensive experiments highlighting InfU’s superiority over existing baselines, such as PuLID-FLUX and FLUX.1-dev IP-Adapters. Key metrics used for evaluation include ID Loss, CLIPScore, and PickScore, with InfU achieving notable improvements in identity similarity, text-image correlation, and image aesthetics.

The plug-and-play characteristic of InfU greatly enhances its practical utility. Its ability to operate smoothly with FLUX variants and its compatibility with ControlNets, LoRAs, and IP-Adapters underpin its versatility in real-world applications. Such adaptability is not only technically appealing but also offers significant contributions to the community by facilitating broader usages and further advancements in the field.

Future Directions

While the results demonstrate InfU’s robust performance, future developments may aim to enhance its scalability and efficiency further. Additionally, exploring its applications beyond traditional portrait generation tasks, such as in avatars and virtual environments, might expand its scope. The framework's design principles can also inform future models that integrate identity preservation with newer generative techniques.

Conclusion

InfiniteYou represents a significant step forward in identity-preserved image generation, leveraging the capabilities of Diffusion Transformers to exceed previous limitations. The introduction of InfuseNet and a sophisticated training regimen underscores the potential of DiTs in enhancing personalized content creation. As applications broaden and technology evolves, frameworks like InfU pave the way for more sophisticated, identity-sensitive image generation.