Text-image Alignment for Diffusion-based Perception (2310.00031v3)
Abstract: Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However, the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically, it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps, leading to better perceptual performance. Our approach improves upon the current state-of-the-art (SOTA) in diffusion-based semantic segmentation on ADE20K and the current overall SOTA for depth estimation on NYUv2. Furthermore, our method generalizes to the cross-domain setting. We use model personalization and caption modifications to align our model to the target domain and find improvements over unaligned baselines. Our cross-domain object detection model, trained on Pascal VOC, achieves SOTA results on Watercolor2K. Our cross-domain segmentation method, trained on Cityscapes, achieves SOTA results on Dark Zurich-val and Nighttime Driving. Project page: https://www.vision.caltech.edu/tadp/. Code: https://github.com/damaggu/TADP.
- eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv preprint arXiv:2211.01324, 2022.
- One-shot Unsupervised Domain Adaptation with Personalized Diffusion Models. arXiv preprint arXiv:2303.18080, 2023.
- ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth. arXiv preprint arXiv:2302.12288, 2023.
- Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., 2009.
- SEGA: Instructing Diffusion using Semantic Dimensions. arXiv preprint arXiv:2301.12247, 2023.
- Refign: Align and Refine for Adaptation of Semantic Segmentation to Adverse Conditions. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022.
- Vision Transformer Adapter for Dense Predictions. arXiv preprint arXiv:2205.08534, 2022.
- The Cityscapes Dataset for Semantic Urban Scene Understanding. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016.
- Dark Model Adaptation: Semantic Image Segmentation from Daytime to Nighttime. 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 3819–3824, 2018.
- Unbiased mean teacher for cross-domain object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4091–4101, 2021.
- Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. In Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014.
- The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
- The PASCAL Visual Object Classes Challenge 2012 (VOC2012), 2012.
- EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, 2022.
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. arXiv preprint arXiv:2208.01618, 2022.
- Prompting Diffusion Representations for Cross-Domain Semantic Segmentation. arXiv preprint arXiv:2307.02138, 2023.
- Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv preprint arXiv:2208.01626, 2022.
- Informative and consistent correspondence mining for cross-domain weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9929–9938, 2021.
- DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9914–9925, 2022.
- Cross-Domain Weakly-Supervised Object Detection Through Progressive Domain Adaptation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5001–5009, 2018.
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. arXiv preprint arXiv:2102.05918, 2021.
- Decoupled adaptation for cross-domain object detection. arXiv preprint arXiv:2110.02578, 2021.
- Panoptic Feature Pyramid Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6392–6401, 2019.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv preprint arXiv:2301.12597, 2023.
- Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. arXiv preprint arXiv:2110.05208, 2022.
- Swin Transformer V2: Scaling Up Capacity and Resolution. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11999–12009, 2021.
- All in Tokens: Unifying Output Space of Visual Tasks via Soft Token. arXiv preprint arXiv:2301.02229, 2023.
- Pseudo-label generation-evaluation framework for cross domain weakly supervised object detection. In 2021 IEEE International Conference on Image Processing (ICIP), pages 724–728. IEEE, 2021.
- Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning, pages 8748–8763, 2021.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125, 2022.
- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18061–18070, 2022.
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
- High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241. Springer International Publishing, Cham, 2015.
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. arXiv preprint arXiv:2208.12242, 2022.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487, 2022.
- Guided Curriculum Model Adaptation and Uncertainty-Aware Evaluation for Semantic Nighttime Image Segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7373–7382, 2019.
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. arXiv preprint arXiv:2111.02114, 2021.
- LAION-5B: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Indoor Segmentation and Support Inference from RGBD Images. European Conference on Computer Vision (ECCV), 2012.
- Domain-adaptive self-supervised pre-training for face & body detection in drawings. arXiv preprint arXiv:2211.10641, 2022.
- Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.
- CLIP the Gap: A Single Domain Generalization Approach for Object Detection. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3219–3229, 2023.
- ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities. arXiv preprint arXiv:2305.11172, 2023a.
- InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14408–14419, 2022.
- Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19175–19186, 2023b.
- DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models. arXiv preprint arXiv:2303.11681, 2023.
- H2fa r-cnn: Holistic and hierarchical feature alignment for cross-domain weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14329–14339, 2022.
- Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. arXiv preprint arXiv:2206.10789, 2022.
- Unleashing Text-to-Image Diffusion Models for Visual Perception. arXiv preprint arXiv:2303.02153, 2023.
- Adaptive object detection with dual multi-label prediction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, pages 54–69. Springer, 2020.
- Scene Parsing through ADE20K Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017.
- Neehar Kondapaneni (7 papers)
- Markus Marks (9 papers)
- Manuel Knott (7 papers)
- Pietro Perona (78 papers)
- Rogerio Guimaraes (1 paper)