Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics (2410.18537v1)
Abstract: Traditionally, style has been primarily considered in terms of artistic elements such as colors, brushstrokes, and lighting. However, identical semantic subjects, like people, boats, and houses, can vary significantly across different artistic traditions, indicating that style also encompasses the underlying semantics. Therefore, in this study, we propose a zero-shot scheme for image variation with coordinated semantics. Specifically, our scheme transforms the image-to-image problem into an image-to-text-to-image problem. The image-to-text operation employs vision-LLMs e.g., BLIP) to generate text describing the content of the input image, including the objects and their positions. Subsequently, the input style keyword is elaborated into a detailed description of this style and then merged with the content text using the reasoning capabilities of ChatGPT. Finally, the text-to-image operation utilizes a Diffusion model to generate images based on the text prompt. To enable the Diffusion model to accommodate more styles, we propose a fine-tuning strategy that injects text and style constraints into cross-attention. This ensures that the output image exhibits similar semantics in the desired style. To validate the performance of the proposed scheme, we constructed a benchmark comprising images of various styles and scenes and introduced two novel metrics. Despite its simplicity, our scheme yields highly plausible results in a zero-shot manner, particularly for generating stylized images with high-fidelity semantics.
- Arbitrary style transfer in real-time with adaptive instance normalization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1510–1519, 2017.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2242–2251, 2017.
- A compact transformer for adaptive style transfer. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 2687–2692. IEEE, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv:2307.01952, 2023.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22500–22510, 2023.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022.
- Gpt-4 technical report. arXiv:2303.08774, 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125, 1(2):3, 2022.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
- Image style transfer using convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2414–2423, 2016.
- Context encoders: Feature learning by inpainting. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2536–2544, 2016.
- Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017.
- Stytr2: Image style transfer with transformers. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11316–11326, 2022.
- Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
- Clipstyler: Image style transfer with a single text condition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18041–18050, 2022.
- Zero-shot contrastive loss for text-guided diffusion image style transfer. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22816–22825, 2023.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Denoising diffusion implicit models. arXiv:2010.02502, 2020.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
- Auto-encoding variational bayes. arXiv:1312.6114, 2013.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv:1705.03551, 2017.
- Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019.
- Zero-shot 3d shape correspondence. In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
- Cross-attention makes inference cumbersome in text-to-image diffusion models. arXiv:2404.02747, 2024.
- Wiki art gallery, inc.: A case for critical thinking. Issues in Accounting Education, 26(3):593–608, 2011.
- Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8242–8250, 2018.
- Multi-content complementation network for salient object detection in optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 60:1–13, 2022.
- Styleclip: Text-driven manipulation of stylegan imagery. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2065–2074, 2021.
- Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision, pages 88–105. Springer, 2022.
- Ernie: Enhanced representation through knowledge integration. arXiv:1904.09223, 2019.
- Adding conditional control to text-to-image diffusion models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3813–3824, 2023.