Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification (2310.08255v2)

Published 12 Oct 2023 in cs.CV

Abstract: Vision-LLMs (VLMs) such as CLIP are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions. However, in several cases, their expensive training and data collection/curation costs do not justify the end application. This motivates a vendor-client paradigm, where a vendor trains a large-scale VLM and grants only input-output access to clients on a pay-per-query basis in a black-box setting. The client aims to minimize inference cost by distilling the VLM to a student model using the limited available task-specific data, and further deploying this student model in the downstream application. While naive distillation largely improves the In-Domain (ID) accuracy of the student, it fails to transfer the superior out-of-distribution (OOD) generalization of the VLM teacher using the limited available labeled images. To mitigate this, we propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model, and further distills the aligned VLM representations to the student. This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings. The proposed approach achieves state-of-the-art results on the standard Domain Generalization benchmarks in a black-box teacher setting as well as a white-box setting where the weights of the VLM are accessible.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  2. Ensemble of averages: Improving model selection and boosting performance in domain generalization. NeurIPS, 35, 2022.
  3. Recognition in terra incognita. In ECCV. Springer, 2018.
  4. Poisoning attacks against support vector machines. In ICML. PMLR, 2012.
  5. Domain generalization by marginal transfer learning. The Journal of Machine Learning Research (JMLR), 22(1), 2021.
  6. Exploiting domain-specific features to enhance domain generalization. NeurIPS, 34, 2021.
  7. Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
  8. Swad: Domain generalization by seeking flat minima. NeurIPS, 34, 2021.
  9. Domain generalization by mutual-information regularization with pre-trained models. ECCV, 2022.
  10. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  11. Learning to balance specificity and invariance for in and out of domain generalization. In ECCV. Springer, 2020.
  12. Cross-layer distillation with semantic calibration. In AAAI, 2021a.
  13. Knowledge distillation with the reused teacher classifier. In CVPR, 2022.
  14. Distilling knowledge via knowledge review. In CVPR, 2021b.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  16. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In ICCV, 2013.
  17. Domain-adversarial training of neural networks. The Journal of Machine Learning Research (JMLR), 17(1), 2016.
  18. Finetune like you pretrain: Improved finetuning of zero-shot vision models. In CVPR, 2023.
  19. In search of lost domain generalization. In ICLR, 2021.
  20. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  21. Self-challenging improves cross-domain generalization. In ECCV. Springer, 2020.
  22. A sentence speaks a thousand images: Domain generalization through distilling clip with language guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11685–11695, 2023.
  23. Dart: Diversify-aggregate-repeat training improves generalization of neural networks. In CVPR, 2023.
  24. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML. PMLR, 2021.
  25. Selfreg: Self-supervised contrastive regularization for domain generalization. In ICCV, 2021.
  26. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
  27. Out-of-distribution generalization via risk extrapolation (rex). In ICML. PMLR, 2021.
  28. Fine-tuning can distort pretrained features and underperform out-of-distribution. In ICLR, 2022.
  29. Deep learning. nature, 521(7553), 2015.
  30. Deeper, broader and artier domain generalization. In ICCV, 2017.
  31. Learning to generalize: Meta-learning for domain generalization. In AAAI, 2018a.
  32. Domain generalization with adversarial feature learning. In CVPR, 2018b.
  33. Domain generalization with adversarial feature learning. In CVPR, 2018c.
  34. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML. PMLR, 2022a.
  35. Domain generalization via conditional invariant representations. In AAAI, 2018d.
  36. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In ICLR, 2022b.
  37. Function-consistent feature distillation. In ICLR, 2023.
  38. Causality inspired representation learning for domain generalization. In CVPR, 2022.
  39. Visual classification via description from large language models. In ICLR, 2023.
  40. Text-to-concept (and back) via cross-model alignment. PMLR, 2023.
  41. Reducing domain gap by reducing style bias. In CVPR, 2021.
  42. Permuted adain: Reducing the bias towards global statistics in image classification. In CVPR, 2021.
  43. Alp-kd: Attention-based layer projection for knowledge distillation. In AAAI, 2021.
  44. Moment matching for multi-source domain adaptation. In ICCV, 2019.
  45. Efficient domain generalization via common-specific low-rank decomposition. In ICML. PMLR, 2020.
  46. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  47. Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
  48. Diverse weight averaging for out-of-distribution generalization. NeurIPS, 35, 2022.
  49. Model ratatouille: Recycling diverse models for out-of-distribution generalization. In ICML. PMLR, 2023.
  50. Model-based domain generalization. NeurIPS, 34, 2021.
  51. Fitnets: Hints for thin deep nets. In ICLR, 2015.
  52. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
  53. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  54. Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS, 2022.
  55. Clipood: Generalizing clip to out-of-distributions. PMLR, 2023.
  56. Flava: A foundational language and vision alignment model. In CVPR, 2022.
  57. Deep coral: Correlation alignment for deep domain adaptation. In ECCV. Springer, 2016.
  58. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2), 2016.
  59. Attention is all you need. NeurIPS, 30, 2017.
  60. Deep hashing network for unsupervised domain adaptation. In CVPR, 2017.
  61. Sharpness-aware gradient matching for domain generalization. In CVPR, 2023.
  62. Heterogeneous domain generalization via domain mixup. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
  63. Robust fine-tuning of zero-shot models. In CVPR, 2022.
  64. Filip: Fine-grained interactive language-image pre-training. In ICLR, 2022.
  65. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
  66. Adaptive risk minimization: Learning to adapt to domain shift. NeurIPS, 34, 2021.
  67. Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 130(9), 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sravanti Addepalli (18 papers)
  2. Ashish Ramayee Asokan (5 papers)
  3. Lakshay Sharma (11 papers)
  4. R. Venkatesh Babu (108 papers)
Citations (2)