Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Domain-Controlled Prompt Learning (2310.07730v2)

Published 30 Sep 2023 in cs.CV and eess.IV

Abstract: Large pre-trained vision-LLMs, such as CLIP, have shown remarkable generalization capabilities across various tasks when appropriate text prompts are provided. However, adapting these models to specific domains, like remote sensing images (RSIs), medical images, etc, remains unexplored and challenging. Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms, leading to suboptimal performance due to the misinterpretation of specific images in natural image patterns. To tackle this dilemma, we proposed a \textbf{Domain-Controlled Prompt Learning} for the specific domains. Specifically, the large-scale specific domain foundation model (LSDM) is first introduced to provide essential specific domain knowledge. Using lightweight neural networks, we transfer this knowledge into domain biases, which control both the visual and language branches to obtain domain-adaptive prompts in a directly incorporating manner. Simultaneously, to overcome the existing overfitting challenge, we propose a novel noisy-adding strategy, without extra trainable parameters, to help the model escape the suboptimal solution in a global domain oscillation manner. Experimental results show our method achieves state-of-the-art performance in specific domain image recognition datasets. Our code is available at https://github.com/caoql98/DCPL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems, 35: 33781–33794.
  2. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4): 834–848.
  3. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10): 1865–1883.
  4. CLIP-Art: Contrastive pre-training for fine-grained art classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3956–3960.
  5. Satellite Image Classification via Two-Layer Sparse Coding With Biased Image Representation. IEEE Transactions on Geoscience and Remote Sensing, 8(1): 173–176.
  6. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  8. Write a classifier: Zero-shot learning using purely textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision, 2584–2591.
  9. Promptdet: Towards open-vocabulary detection using uncurated images. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, 701–717. Springer.
  10. Devise: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems, 26.
  11. Cma-clip: Cross-modality attention clip for text-image classification. In 2022 IEEE International Conference on Image Processing (ICIP), 2846–2850. IEEE.
  12. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544.
  13. Domain adaptation via prompt learning. arXiv preprint arXiv:2202.06687.
  14. Girshick, R. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, 1440–1448.
  15. Self-supervised learning of visual features through embedding images into text topic spaces. In Proceedings of the ieee conference on computer vision and pattern recognition, 4230–4239.
  16. Frage: Frequency-agnostic word representation. Advances in Neural Information Processing Systems, 31.
  17. Hashemi, S. M. H. 2023. Crystal Clean: Brain Tumors MRI Dataset.
  18. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  20. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 6840–6851.
  21. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 4904–4916. PMLR.
  22. Learning visual features from large weakly supervised data. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, 67–84. Springer.
  23. Multi-class texture analysis in colorectal cancer histology. Scientific Reports (in press).
  24. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19113–19122.
  25. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision, 4247–4255.
  26. Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546.
  27. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
  28. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3431–3440.
  29. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4): 2183–2195.
  30. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7086–7096.
  31. Segment anything in medical images. arXiv preprint arXiv:2304.12306.
  32. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 8162–8171. PMLR.
  33. Nickparvar, M. 2021. Brain Tumor MRI Dataset.
  34. MLRSNet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding. ISPRS Journal of Photogrammetry and Remote Sensing, 169: 337–350.
  35. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
  36. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18082–18091.
  37. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788.
  38. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 298–307.
  39. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  40. Zero-shot learning through cross-modal transfer. Advances in Neural Information Processing Systems, 26.
  41. Ringmo: A remote sensing foundation model with masked image modeling. IEEE Transactions on Geoscience and Remote Sensing.
  42. Advancing plain vision transformer towards remote sensing foundation model. IEEE Transactions on Geoscience and Remote Sensing.
  43. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 139–149.
  44. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55(7): 3965–3981.
  45. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Advances in Neural Information Processing Systems, 34: 28522–28535.
  46. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, 270–279.
  47. FILIP: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783.
  48. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
  49. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18123–18133.
  50. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930.
  51. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16816–16825.
  52. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
  53. PatternNet: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS journal of photogrammetry and remote sensing, 145: 197–209.
  54. Deep learning based feature selection for remote sensing scene classification. IEEE Geoscience and remote sensing letters, 12(11): 2321–2325.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com