Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification (2310.08255v2)
Abstract: Vision-LLMs (VLMs) such as CLIP are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions. However, in several cases, their expensive training and data collection/curation costs do not justify the end application. This motivates a vendor-client paradigm, where a vendor trains a large-scale VLM and grants only input-output access to clients on a pay-per-query basis in a black-box setting. The client aims to minimize inference cost by distilling the VLM to a student model using the limited available task-specific data, and further deploying this student model in the downstream application. While naive distillation largely improves the In-Domain (ID) accuracy of the student, it fails to transfer the superior out-of-distribution (OOD) generalization of the VLM teacher using the limited available labeled images. To mitigate this, we propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model, and further distills the aligned VLM representations to the student. This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings. The proposed approach achieves state-of-the-art results on the standard Domain Generalization benchmarks in a black-box teacher setting as well as a white-box setting where the weights of the VLM are accessible.
- Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
- Ensemble of averages: Improving model selection and boosting performance in domain generalization. NeurIPS, 35, 2022.
- Recognition in terra incognita. In ECCV. Springer, 2018.
- Poisoning attacks against support vector machines. In ICML. PMLR, 2012.
- Domain generalization by marginal transfer learning. The Journal of Machine Learning Research (JMLR), 22(1), 2021.
- Exploiting domain-specific features to enhance domain generalization. NeurIPS, 34, 2021.
- Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
- Swad: Domain generalization by seeking flat minima. NeurIPS, 34, 2021.
- Domain generalization by mutual-information regularization with pre-trained models. ECCV, 2022.
- Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Learning to balance specificity and invariance for in and out of domain generalization. In ECCV. Springer, 2020.
- Cross-layer distillation with semantic calibration. In AAAI, 2021a.
- Knowledge distillation with the reused teacher classifier. In CVPR, 2022.
- Distilling knowledge via knowledge review. In CVPR, 2021b.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In ICCV, 2013.
- Domain-adversarial training of neural networks. The Journal of Machine Learning Research (JMLR), 17(1), 2016.
- Finetune like you pretrain: Improved finetuning of zero-shot vision models. In CVPR, 2023.
- In search of lost domain generalization. In ICLR, 2021.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Self-challenging improves cross-domain generalization. In ECCV. Springer, 2020.
- A sentence speaks a thousand images: Domain generalization through distilling clip with language guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11685–11695, 2023.
- Dart: Diversify-aggregate-repeat training improves generalization of neural networks. In CVPR, 2023.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML. PMLR, 2021.
- Selfreg: Self-supervised contrastive regularization for domain generalization. In ICCV, 2021.
- Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
- Out-of-distribution generalization via risk extrapolation (rex). In ICML. PMLR, 2021.
- Fine-tuning can distort pretrained features and underperform out-of-distribution. In ICLR, 2022.
- Deep learning. nature, 521(7553), 2015.
- Deeper, broader and artier domain generalization. In ICCV, 2017.
- Learning to generalize: Meta-learning for domain generalization. In AAAI, 2018a.
- Domain generalization with adversarial feature learning. In CVPR, 2018b.
- Domain generalization with adversarial feature learning. In CVPR, 2018c.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML. PMLR, 2022a.
- Domain generalization via conditional invariant representations. In AAAI, 2018d.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In ICLR, 2022b.
- Function-consistent feature distillation. In ICLR, 2023.
- Causality inspired representation learning for domain generalization. In CVPR, 2022.
- Visual classification via description from large language models. In ICLR, 2023.
- Text-to-concept (and back) via cross-model alignment. PMLR, 2023.
- Reducing domain gap by reducing style bias. In CVPR, 2021.
- Permuted adain: Reducing the bias towards global statistics in image classification. In CVPR, 2021.
- Alp-kd: Attention-based layer projection for knowledge distillation. In AAAI, 2021.
- Moment matching for multi-source domain adaptation. In ICCV, 2019.
- Efficient domain generalization via common-specific low-rank decomposition. In ICML. PMLR, 2020.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
- Diverse weight averaging for out-of-distribution generalization. NeurIPS, 35, 2022.
- Model ratatouille: Recycling diverse models for out-of-distribution generalization. In ICML. PMLR, 2023.
- Model-based domain generalization. NeurIPS, 34, 2021.
- Fitnets: Hints for thin deep nets. In ICLR, 2015.
- Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS, 2022.
- Clipood: Generalizing clip to out-of-distributions. PMLR, 2023.
- Flava: A foundational language and vision alignment model. In CVPR, 2022.
- Deep coral: Correlation alignment for deep domain adaptation. In ECCV. Springer, 2016.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2), 2016.
- Attention is all you need. NeurIPS, 30, 2017.
- Deep hashing network for unsupervised domain adaptation. In CVPR, 2017.
- Sharpness-aware gradient matching for domain generalization. In CVPR, 2023.
- Heterogeneous domain generalization via domain mixup. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
- Robust fine-tuning of zero-shot models. In CVPR, 2022.
- Filip: Fine-grained interactive language-image pre-training. In ICLR, 2022.
- Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
- Adaptive risk minimization: Learning to adapt to domain shift. NeurIPS, 34, 2021.
- Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 130(9), 2022.
- Sravanti Addepalli (18 papers)
- Ashish Ramayee Asokan (5 papers)
- Lakshay Sharma (11 papers)
- R. Venkatesh Babu (108 papers)