ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet (2312.03154v2)
Abstract: This paper introduces ViscoNet, a novel one-branch-adapter architecture for concurrent spatial and visual conditioning. Our lightweight model requires trainable parameters and dataset size multiple orders of magnitude smaller than the current state-of-the-art IP-Adapter. However, our method successfully preserves the generative power of the frozen text-to-image (T2I) backbone. Notably, it excels in addressing mode collapse, a pervasive issue previously overlooked. Our novel architecture demonstrates outstanding capabilities in achieving a harmonious visual-text balance, unlocking unparalleled versatility in various human image generation tasks, including pose re-targeting, virtual try-on, stylization, person re-identification, and textile transfer.Demo and code are available from project page https://soon-yau.github.io/visconet/ .
- Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF conference on computer Vision and pattern recognition, pages 18511–18521, 2022.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
- Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727, 2020.
- Person image synthesis via denoising diffusion model. IEEE Conference of Computer Vision and Pattern Recognition (CVPR), 2023.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374, 2023a.
- Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023b.
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023c.
- Kpe: Keypoint pose encoding for transformer-based image generation. British Machine Vision Conference (BMVC), 2022.
- Upgpt: Universal diffusion model for person image generation, editing and pose transfer. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023.
- Diffusion models beat gans on image synthesis. Conference on Neural Information Processing Systems (NeurIPS), 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. International Conference for Learning Representations (ICLR), 2020.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Mask-guided portrait editing with conditional gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3436–3445, 2019.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. Conference on Neural Information Processing Systems (NeurIPS), 2020.
- HuggingFace. openai/clip-vit-large-patch14.
- Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642, 2023.
- Text2human: Text-driven controllable human image generation. SIGGRAPH, 2022.
- Humansd: A native skeleton-guided diffusion model for human image generation. International Conference on Computer Vision (ICCV), 2023.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. International Conference on Machine Learning (ICML), 2023.
- Editgan: High-precision semantic image editing. Advances in Neural Information Processing Systems, 34:16331–16345, 2021.
- More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023a.
- Hyperhuman: Hyper-realistic human generation with latent structural diffusion. Arxiv preprint: 2310.08579, 2023b.
- Tf-icon: Diffusion-based training-free cross-domain image composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2294–2305, 2023.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. Arxiv preprint 2302.08453, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. Proceedings of Machine Learning Research, 2021.
- Justin Pinkney. Stable diffusion image variations. https://github.com/justinpinkney/stable-diffusion, 2022.
- Learning transferable visual models from natural language supervision. International Conference on Machine Learning (ICML), 2021.
- Zero-shot text-to-image generation. International Conference on Machine Learning (ICML), 2021.
- Hierarchical text-conditional image generation with clip latents. Arxiv Preprint: 2204.06125, 2022.
- Neural texture extraction and distribution for controllable person image synthesis. IEEE Conference of Computer Vision and Pattern Recognition (CVPR), 2022.
- Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2287–2296, 2021.
- High-resolution image synthesis with latent diffusion models. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer Assisted Interventions (MICCAI), 2015.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. Arxiv preprint: 2205.11487, 2022.
- Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
- Objectstitch: Generative object compositing. arXiv preprint arXiv:2212.00932, 2022.
- Attention is all you need. Conference on Neural Information Processing Systems (NeurIPS), 2017.
- Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
- Disco: Disentangled control for referring human dance generation in real world. Arxiv Preprint 2307.00040, 2023.
- Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems and Computers, 2003, pages 1398–1402 Vol.2, 2003.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
- Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
- Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. International Conference on Computer Vision (ICCV), 2017.
- Pise: Person image synthesis and editing with decoupled gan. IEEE Conference of Computer Vision and Pattern Recognition (CVPR), 2021.
- Humandiffusion: a coarse-to-fine alignment diffusion framework for controllable text-driven person image generation. Arxiv Preprint 2211.06235, 2022a.
- Adding conditional control to text-to-image diffusion models. International Computer Vision Conference (ICCV), 2023.
- Exploring dual-task correlation for pose guided person image generation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
- Cross attention based style distribution for controllable person image synthesis. European Conference on Computer Vision (ECCV) IEEE Conference of Computer Vision and Pattern Rec, 2022.
- Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019a.
- Progressive pose attention transfer for person image generation. IEEE Conference of Computer Vision and Pattern Recognition (CVPR), 2019b.
- Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.