Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation (2306.16658v1)
Abstract: Traditional domain adaptation assumes the same vocabulary across source and target domains, which often struggles with limited transfer flexibility and efficiency while handling target domains with different vocabularies. Inspired by recent vision-LLMs (VLMs) that enable open-vocabulary visual recognition by reasoning on both images and texts, we study open-vocabulary domain adaptation (OVDA), a new unsupervised domain adaptation framework that positions a pre-trained VLM as the source model and transfers it towards arbitrary unlabelled target domains. To this end, we design a Prompt Ensemble Self-training (PEST) technique that exploits the synergy between vision and language to mitigate the domain discrepancies in image and text distributions simultaneously. Specifically, PEST makes use of the complementary property of multiple prompts within and across vision and language modalities, which enables joint exploitation of vision and language information and effective learning of image-text correspondences in the unlabelled target domains. Additionally, PEST captures temporal information via temporal prompt ensemble which helps memorize previously learnt target information. Extensive experiments show that PEST outperforms the state-of-the-art consistently across 10 image recognition tasks.
- Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Road: Reality oriented adaptation for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7892–7901, 2018.
- Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- Template-based named entity recognition using bart. arXiv preprint arXiv:2106.01760, 2021.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020.
- Scale variance minimization for unsupervised domain adaptation in image segmentation. Pattern Recognition, 112:107764, 2021.
- Domain adaptive video segmentation via temporal consistency regularization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8053–8064, 2021.
- Uncertainty-aware unsupervised domain adaptation in object detection. IEEE Transactions on Multimedia, 2021.
- Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450, 2018.
- Warp: Word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121, 2021.
- Ptr: Prompt tuning with rules for text classification. AI Open, 3:182–192, 2022.
- Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Cycada: Cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning, pages 1989–1998, 2018.
- Conditional generative adversarial network for structured domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1335–1344, 2018.
- Mlan: Multi-level adversarial network for domain adaptive semantic segmentation. Pattern Recognition, 2021.
- Category contrast for unsupervised domain adaptation in visual tasks. arXiv preprint arXiv:2106.02885, 2021.
- Cross-view regularization for domain adaptive panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10133–10144, 2021.
- Fsdr: Frequency space domain randomization for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6891–6902, 2021.
- Model adaptation: Historical contrastive learning for unsupervised domain adaptation without source data. Advances in Neural Information Processing Systems, 34:3635–3649, 2021.
- Rda: Robust domain adaptation via fourier adversarial attacking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Contextual-relation consistent domain adaptation for semantic segmentation. In European Conference on Computer Vision, pages 705–722. Springer, 2020.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
- Reordering examples helps during priming-based few-shot learning. arXiv preprint arXiv:2106.01751, 2021.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Masked unsupervised self-training for zero-shot image classification. arXiv preprint arXiv:2206.02967, 2022.
- Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6936–6945, 2019.
- Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, pages 6028–6039. PMLR, 2020.
- Source data-absent unsupervised domain adaptation through hypothesis transfer and labeling transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
- Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096, 2022.
- Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2507–2516, 2019.
- Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
- Learning how to ask: Querying lms with mixtures of soft prompts. arXiv preprint arXiv:2104.06599, 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Adaptiope: A modern benchmark for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 101–110, 2021.
- Adapting visual category models to new domains. In ECCV, 2010.
- Adversarial dropout regularization. arXiv preprint arXiv:1711.01575, 2017.
- Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2018.
- Effective use of synthetic data for urban scene semantic segmentation. In European Conference on Computer Vision, pages 86–103. Springer, 2018.
- Learning from synthetic data: Addressing domain shift for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3752–3761, 2018.
- Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676, 2020.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- The german traffic sign recognition benchmark: a multi-class classification competition. In The 2011 international joint conference on neural networks, pages 1453–1460. IEEE, 2011.
- Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7472–7481, 2018.
- Domain adaptation for structured output via discriminative patch representations. In Proceedings of the IEEE International Conference on Computer Vision, pages 1456–1465, 2019.
- Image-and-language understanding from pixels only. arXiv preprint arXiv:2212.08045, 2022.
- Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
- Deep hashing network for unsupervised domain adaptation. In CVPR, 2017.
- Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2517–2526, 2019.
- Gpt-j-6b: A 6 billion parameter autoregressive language model. 2021.
- Toalign: Task-oriented alignment for unsupervised domain adaptation. 35th Advances in Neural Information Processing Systems, 2021.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
- Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
- Fda: Fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4085–4095, 2020.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
- Open-vocabulary detr with conditional matching. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, pages 106–122. Springer, 2022.
- Vision-language models for vision tasks: A survey. arXiv preprint arXiv:2304.00685, 2023.
- Spectral unsupervised domain adaptation for visual recognition. arXiv preprint arXiv:2106.06112, 2021.
- Da-detr: Domain adaptive detection transformer by hybrid attention. arXiv preprint arXiv:2103.17084, 2021.
- Curriculum domain adaptation for semantic segmentation of urban scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 2020–2030, 2017.
- Invariance matters: Exemplar memory for domain adaptive person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 598–607, 2019.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
- Xiaojin Jerry Zhu. Semi-supervised learning literature survey. 2005.
- Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pages 289–305, 2018.
- Confidence regularized self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5982–5991, 2019.
- Jiaxing Huang (68 papers)
- Jingyi Zhang (63 papers)
- Han Qiu (60 papers)
- Sheng Jin (69 papers)
- Shijian Lu (151 papers)