FaceStudio: Put Your Face Everywhere in Seconds (2312.02663v2)
Abstract: This study investigates identity-preserving image synthesis, an intriguing task in image generation that seeks to maintain a subject's identity while adding a personalized, stylistic touch. Traditional methods, such as Textual Inversion and DreamBooth, have made strides in custom image creation, but they come with significant drawbacks. These include the need for extensive resources and time for fine-tuning, as well as the requirement for multiple reference images. To overcome these challenges, our research introduces a novel approach to identity-preserving synthesis, with a particular focus on human images. Our model leverages a direct feed-forward mechanism, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation. Central to our innovation is a hybrid guidance framework, which combines stylized images, facial images, and textual prompts to guide the image generation process. This unique combination enables our model to produce a variety of applications, such as artistic portraits and identity-blended images. Our experimental results, including both qualitative and quantitative evaluations, demonstrate the superiority of our method over existing baseline models and previous works, particularly in its remarkable efficiency and ability to preserve the subject's identity with high fidelity.
- Implementation of EulerAncestralDiscreteScheduler based on k-diffusion.
- Implementation of Hypernetwork.
- A neural space-time representation for text-to-image personalization. arXiv preprint arXiv:2305.15391, 2023.
- Break-a-scene: Extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311, 2023.
- Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 843–852, 2023.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
- Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374, 2023a.
- Photoverse: Tuning-free image customization with text-to-image diffusion models. arXiv preprint arXiv:2309.05793, 2023b.
- Compositional prototype network with multi-view comparision for few-shot point cloud semantic segmentation. arXiv preprint arXiv:2012.14255, 2020.
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023c.
- Arcface: Additive angular margin loss for deep face recognition. Computer Vision and Pattern Recognition, 2018.
- Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:2211.11337, 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Encoder-based domain tuning for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228, 2023.
- Photoswap: Personalized subject swapping in images. arXiv preprint arXiv:2305.18286, 2023.
- Improving tuning-free real image editing with proximal guidance. CoRR, 2023.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. Neural Information Processing Systems, 2020.
- Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
- Magicapture: High-resolution multi-concept portrait customization. arXiv preprint arXiv:2309.06895, 2023.
- Zero-shot generation of coherent storybook from plain text story using diffusion models. arXiv preprint arXiv:2302.03900, 2023.
- Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642, 2023.
- A style-based generator architecture for generative adversarial networks. Computer Vision and Pattern Recognition, 2018.
- Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
- Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
- Tackling background ambiguities in multi-class few-shot point cloud semantic segmentation. Knowledge-Based Systems, 253:109508, 2022.
- Faceshifter: Towards high fidelity and occlusion aware face swapping. arXiv preprint arXiv:1912.13457, 2019.
- Crnet: Cross-reference networks for few-shot segmentation. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2020.
- Few-shot segmentation with optimal transport matching and message flow. IEEE Transactions on Multimedia (TMM), 2021.
- Crcnet: Few-shot segmentation with cross-reference and region-global conditional networks. International Journal of Computer Vision (IJCV), 2022.
- Facechain: A playground for identity-preserving portrait generation. arXiv preprint arXiv:2308.14256, 2023a.
- Cones: Concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125, 2023b.
- Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Deepfacelab: Integrated, flexible and extensible face-swapping framework. arXiv preprint arXiv:2005.05535, 2020.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023a.
- Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023b.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Neural Information Processing Systems, 2022.
- Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
- Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
- Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022.
- Face0: Instantaneously conditioning a text-to-image model on a face. arXiv preprint arXiv:2306.06638, 2023.
- Hififace: 3d shape and semantic prior guided high fidelity face swapping. arXiv preprint arXiv:2106.09965, 2021.
- Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848, 2023.
- Easyphoto: Your smart ai photo generator. arXiv preprint arXiv:2310.04672, 2023a.
- Singleinsert: Inserting new concepts from a single image into text-to-image models for flexible editing. arXiv preprint arXiv:2310.08094, 2023b.
- Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
- Prompt-free diffusion: Taking” text” out of text-to-image diffusion models. arXiv preprint arXiv:2305.16223, 2023.
- Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
- Efficient few-shot object detection via knowledge inheritance. IEEE Transactions on Image Processing (TIP), 2022.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
- Inserting anybody in diffusion models via celeb basis. arXiv preprint arXiv:2306.00926, 2023.
- Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In IEEE International Conference on Computer Vision (ICCV), 2019a.
- Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2019b.
- Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR ORAL), 2020.
- Meta navigator: Search for a good adaptation policy for few-shot learning. In IEEE International Conference on Computer Vision (ICCV), 2021a.
- Few-shot incremental learning with continually evolved classifiers. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2021b.
- Deepemd: Differentiable earth mover’s distance for few-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Enhancing detail preservation for customized text-to-image generation: A regularization-free approach. arXiv preprint arXiv:2305.13579, 2023.
- One shot face swapping on megapixels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4834–4844, 2021.
- Yuxuan Yan (15 papers)
- Chi Zhang (567 papers)
- Rui Wang (996 papers)
- Yichao Zhou (33 papers)
- Gege Zhang (3 papers)
- Pei Cheng (11 papers)
- Gang Yu (114 papers)
- Bin Fu (74 papers)