Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Review of Large Vision Models and Visual Prompt Engineering (2307.00855v1)

Published 3 Jul 2023 in cs.CV and cs.AI
Review of Large Vision Models and Visual Prompt Engineering

Abstract: Visual prompt engineering is a fundamental technology in the field of visual and image Artificial General Intelligence, serving as a key component for achieving zero-shot capabilities. As the development of large vision models progresses, the importance of prompt engineering becomes increasingly evident. Designing suitable prompts for specific visual tasks has emerged as a meaningful research direction. This review aims to summarize the methods employed in the computer vision domain for large vision models and visual prompt engineering, exploring the latest advancements in visual prompt engineering. We present influential large models in the visual domain and a range of prompt engineering methods employed on these models. It is our hope that this review provides a comprehensive and systematic description of prompt engineering methods based on large visual models, offering valuable insights for future researchers in their exploration of this field.

Advances and Challenges in Visual Prompt Engineering for Large Vision Models

Introduction to Visual Prompt Engineering

Visual prompt engineering has increasingly become a focal point in the exploration of artificial general intelligence (AGI) within computer vision. This development is especially significant in light of the rapid progression seen in large vision models. The essence of prompt engineering lies in the creation and optimization of visual prompts, which are employed to steer these expansive models towards producing desired outcomes for specific visual tasks. This domain covers a wide array of approaches, including text, image, and text-image prompts, each tailored for varying requirements across different tasks.

Evolution of Large Models and Prompt Engineering

The expansion of large models has been a dynamic and transformative journey, initially spurred by the introduction of the Transformer architecture. Successive advancements have seen models such as BERT, GPT series, and ViT reshape the landscape of both NLP and computer vision. These models, built on extensive datasets through self-supervised learning, demonstrate an impressive ability to grasp and generate natural language and understand image content, showcasing remarkable adaptability across various downstream tasks.

In parallel, the field of prompt engineering has witnessed substantial progress, establishing itself as a crucial methodology in harnessing the potential of large vision models effectively. This includes the development of models like CLIP, which utilize prompts in multi-modal learning scenarios, and SAM, which optimizes for downstream task efficiency through prompt-based approaches.

Key Models and Their Contributions

Several models have been at the forefront of integrating visual prompts into the field of large vision models, including:

  • CLIP and ALIGN, which have set benchmarks in aligning textual and visual information through contrastive learning.
  • Visual Prompt Tuning (VPT), which introduced a novel approach by modifying the input with task-specific learnable parameters, facilitating efficient fine-tuning.
  • SAM, which leverages prompt engineering for a wide array of segmentation tasks, demonstrating the model's strong generalization capabilities in a zero-shot manner.

Prompts in Multi-Modal Learning and Visual Tasks

Prompt engineering transcends mere text manipulation, extending into multi-modal and visually intensive tasks:

  • Multi-modal prompts have seen innovations such as CoOP, which introduces trainable continuous vectors as prompts, and DenseCLIP, which adapts CLIP for dense prediction tasks.
  • Visual prompts have been instrumental in interactive segmentation and few-shot image segmentation, with methods like VPT and AdaptFormer showcasing efficient fine-tuning strategies.
  • The adaptation of foundational models, such as blending SAM with models like CLIP and SD under approaches like Edit Everything and SAM-Track, exemplifies the extensive potential of prompts in enhancing the generalization capabilities of large models across varied tasks.

Future Directions in Prompt Engineering

The exploration of visual prompt engineering indicates a promising trajectory towards realizing the full potential of AGI in computer vision. This includes refining prompt design strategies, improving model adaptability through prompts, and enhancing the interplay between textual and visual prompts for multi-modal tasks.

Conclusion

Visual prompt engineering emerges as a pivotal technique for maximizing the utility of large vision models, bringing us closer to achieving sophisticated computer vision capabilities. The continuous evolution of prompt-based methodologies promises to expand the horizons of what is achievable in AGI, fostering advancements that will likely redefine our interaction with and understanding of visual content in the digital age.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (171)
  1. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  2. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  3. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  4. Improving language understanding by generative pre-training. 2018.
  5. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  6. Mask-guided bert for few shot text classification. arXiv preprint arXiv:2302.10447, 2023.
  7. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  8. Clinicalradiobert: Knowledge-infused few shot learning for clinical notes named entity recognition. In International Workshop on Machine Learning in Medical Imaging, pages 269–278. Springer, 2022.
  9. Agribert: knowledge-infused agricultural language models for matching food and nutrition. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, volume 7, pages 5150–5156, 2022.
  10. Context matters: A strategy to pre-train language model for science education. arXiv preprint arXiv:2301.12031, 2023.
  11. A comprehensive survey on segment anything model for vision and beyond. arXiv preprint arXiv:2305.08196, 2023.
  12. Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research, pages 1–36, 2023.
  13. Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. arXiv preprint arXiv:2304.01852, 2023.
  14. Evaluating large language models on a highly-specialized topic, radiation oncology physics. arXiv preprint arXiv:2304.01938, 2023.
  15. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032, 2023.
  16. Impressiongpt: an iterative optimizing framework for radiology report summarization with chatgpt. arXiv preprint arXiv:2304.08448, 2023.
  17. Exploring the trade-offs: Unified large language models vs local fine-tuned models for highly-specific radiology nli task. arXiv preprint arXiv:2304.09138, 2023.
  18. Chatabl: Abductive learning via natural language interaction with chatgpt. arXiv preprint arXiv:2304.11107, 2023.
  19. Differentiate chatgpt-generated and human-written medical texts. arXiv preprint arXiv:2304.11567, 2023.
  20. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  21. Introducing ChatGPT — openai.com. https://openai.com/blog/chatgpt. [Accessed 03-Jul-2023].
  22. OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
  23. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  24. Radiology-gpt: A large language model for radiology. arXiv preprint arXiv:2306.08666, 2023.
  25. When brain-inspired ai meets agi. arXiv preprint arXiv:2303.15935, 2023.
  26. Ad-autogpt: An autonomous gpt for alzheimer’s disease infodemiology. arXiv preprint arXiv:2306.10095, 2023.
  27. Exploring new frontiers in agricultural nlp: Investigating the potential of large language models for food applications. arXiv preprint arXiv:2306.11892, 2023.
  28. Prompt engineering for healthcare: Methodologies and applications. arXiv preprint arXiv:2304.14670, 2023.
  29. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  30. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
  31. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14549–14560, 2023.
  32. Mask-guided vision transformer (mg-vit) for few-shot learning. arXiv preprint arXiv:2205.09995, 2022.
  33. Metavit: Metabolism-aware vision transformer for differential diagnosis of parkinsonism with 18 f-fdg pet. In International Conference on Information Processing in Medical Imaging, pages 132–144. Springer, 2023.
  34. Instruction-vit: Multi-modal prompts for instruction learning in vit. arXiv preprint arXiv:2305.00201, 2023.
  35. A unified and biologically-plausible relational graph representation of vision transformers. arXiv preprint arXiv:2206.11073, 2022.
  36. Rectify vit shortcut learning by visual saliency. arXiv preprint arXiv:2206.08567, 2022.
  37. Core-periphery principle guided redesign of self-attention in transformers. arXiv preprint arXiv:2303.15569, 2023.
  38. Classification of alzheimer’s disease via vision transformer: Classification of alzheimer’s disease via vision transformer. In Proceedings of the 15th International Conference on PErvasive Technologies Related to Assistive Environments, pages 463–468, 2022.
  39. Gyri vs. sulci: Disentangling brain core-periphery functional networks via twin-transformer. arXiv preprint arXiv:2302.00146, 2023.
  40. Disentangling spatial-temporal functional brain networks via twin-transformers. arXiv preprint arXiv:2204.09225, 2022.
  41. Accurate and efficient deep neural network based deformable image registration method in lung cancer. In MEDICAL PHYSICS, volume 49, pages E148–E148. WILEY 111 RIVER ST, HOBOKEN 07030-5774, NJ USA, 2022.
  42. All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6598–6608, 2023.
  43. Community graph convolution neural network for alzheimer’s disease classification and pathogenetic factors identification. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  44. Beam mask and sliding window-facilitated deep learning-based accurate and efficient dose prediction for pencil beam scanning proton therapy. arXiv preprint arXiv:2305.18572, 2023.
  45. Deep-learning based fast and accurate 3d ct deformable image registration in lung cancer. Medical Physics, 2023.
  46. Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23045–23055, 2023.
  47. Discovering dynamic functional brain networks via spatial and channel-wise attention. arXiv preprint arXiv:2205.09576, 2022.
  48. Psa-net: Deep learning–based physician style–aware segmentation network for postoperative prostate cancer clinical target volumes. Artificial Intelligence in Medicine, 121:102195, 2021.
  49. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  50. Gerson H Cohen. Align: a program to superimpose protein coordinates, accounting for insertions and deletions. Journal of applied crystallography, 30(6):1160–1161, 1997.
  51. Artificial general intelligence for medical imaging. arXiv preprint arXiv:2306.05480, 2023.
  52. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  53. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  54. How segment anything model (sam) boost medical image segmentation? arXiv preprint arXiv:2305.03678, 2023.
  55. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  56. Xcopa: A multilingual dataset for causal commonsense reasoning. arXiv preprint arXiv:2005.00333, 2020.
  57. Web search intent induction via automatic query reformulation. In Proceedings of HLT-NAACL 2004: Short Papers, pages 49–52, 2004.
  58. Commongen: A constrained text generation challenge for generative commonsense reasoning. arXiv preprint arXiv:1911.03705, 2019.
  59. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  60. Ask the right questions: Active question reformulation with reinforcement learning. arXiv preprint arXiv:1705.07830, 2017.
  61. Prototypical networks for few-shot learning. In Advances in neural information processing systems, volume 30, 2017.
  62. Few-shot text generation with pattern-exploiting training. arXiv preprint arXiv:2012.11926, 2020.
  63. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020.
  64. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
  65. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  66. Reducing retraining by recycling parameter-efficient prompts. arXiv preprint arXiv:2208.05577, 2022.
  67. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  68. Coarse-to-fine knowledge graph domain adaptation based on distantly-supervised iterative training. arXiv preprint arXiv:2211.02849, 2022.
  69. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  70. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  71. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. arXiv preprint arXiv:2304.03279, 2023.
  72. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  73. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
  74. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  75. Transformer in transformer. Advances in Neural Information Processing Systems, 34:15908–15919, 2021.
  76. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  77. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.
  78. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  79. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  80. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  81. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  82. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  83. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023.
  84. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  85. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
  86. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  87. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
  88. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  89. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 2022.
  90. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
  91. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
  92. Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022.
  93. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
  94. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
  95. Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022.
  96. Visual prompt tuning for generative transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19840–19851, 2023.
  97. Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging. arXiv preprint arXiv:2304.04155, 2023.
  98. Segment anything model for medical image analysis: an experimental study. arXiv preprint arXiv:2304.10517, 2023.
  99. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023.
  100. Can sam segment polyps? arXiv preprint arXiv:2304.07583, 2023.
  101. Generalist vision foundation models for medical imaging: A case study of segment anything model on zero-shot medical segmentation. Diagnostics, 13(11):1947, 2023.
  102. Accuracy of segment-anything model (sam) in medical image segmentation tasks. arXiv preprint arXiv:2304.09324, 2023.
  103. Input augmentation with sam: Boosting medical image segmentation with segmentation foundation model. arXiv preprint arXiv:2304.11332, 2023.
  104. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
  105. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023.
  106. Automated movement tracking of young autistic children during free play is correlated with clinical features associated with autism. Autism, page 13623613231169546, 2023.
  107. Ntire 2023 challenge on 360deg omnidirectional image and video super-resolution: Datasets, methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1731–1745, 2023.
  108. Knowledge distillation with segment anything (sam) model for planetary geological mapping. arXiv preprint arXiv:2305.07586, 2023.
  109. Scalable mask annotation for video text spotting. arXiv preprint arXiv:2305.01443, 2023.
  110. Weakly-supervised concealed object segmentation with sam-based pseudo labeling and multi-scale feature grouping. arXiv preprint arXiv:2305.11003, 2023.
  111. Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261, 2023.
  112. Sam meets robotic surgery: An empirical study in robustness perspective. arXiv preprint arXiv:2304.14674, 2023.
  113. Robot based transurethral bladder tumor resection with automatic detection of tumor cells. Measurement, 206:112079, 2023.
  114. Analyzing schedule dependency and sequencing changes for robotic construction using graph analysis. Journal of Computing in Civil Engineering, 37(1):04022043, 2023.
  115. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
  116. Sam. md: Zero-shot medical image segmentation capabilities of the segment anything model. arXiv preprint arXiv:2304.05396, 2023.
  117. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
  118. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18082–18091, 2022.
  119. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
  120. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
  121. Galip: Generative adversarial clips for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14214–14223, 2023.
  122. Position-guided text prompt for vision-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23242–23251, 2023.
  123. Deep interactive object selection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 373–381, 2016.
  124. Interactive medical image segmentation using deep learning with image-specific fine tuning. IEEE transactions on medical imaging, 37(7):1562–1573, 2018.
  125. Interactive image segmentation via backpropagating refinement scheme. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5297–5306, 2019.
  126. Interactive image segmentation with first click attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13339–13348, 2020.
  127. Conditional diffusion for interactive segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7345–7354, 2021.
  128. Image segmentation with a bounding box prior. In 2009 IEEE 12th international conference on computer vision, pages 277–284. IEEE, 2009.
  129. Milcut: A sweeping line multiple instance learning paradigm for interactive image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 256–263, 2014.
  130. Deepcut: Object segmentation from bounding box annotations using convolutional neural networks. IEEE transactions on medical imaging, 36(2):674–683, 2016.
  131. icoseg: Interactive co-segmentation with intelligent scribble guidance. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3169–3176. IEEE, 2010.
  132. Error-tolerant scribbles based interactive image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 392–399, 2014.
  133. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3159–3167, 2016.
  134. One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410, 2017.
  135. Few-shot semantic segmentation with prototype learning. In BMVC, volume 3, 2018.
  136. Panet: Few-shot image semantic segmentation with prototype alignment. In proceedings of the IEEE/CVF international conference on computer vision, pages 9197–9206, 2019.
  137. Adaptformer: Adapting vision transformers for scalable visual recognition. arXiv preprint arXiv:2205.13535, 2022.
  138. Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
  139. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9516–9526, 2023.
  140. Diversity-aware meta visual prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10878–10887, 2023.
  141. Oneformer: One transformer to rule universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2989–2998, 2023.
  142. Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284, 2023.
  143. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.
  144. Uni-perceiver v2: A generalist model for large-scale vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2691–2700, 2023.
  145. Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv preprint arXiv:2304.05750, 2023.
  146. Can sam count anything? an empirical study on sam counting. arXiv preprint arXiv:2304.10817, 2023.
  147. Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models. arXiv preprint arXiv:2304.10597, 2023.
  148. Scaling-up remote sensing segmentation dataset with segment anything model. arXiv preprint arXiv:2305.02034, 2023.
  149. Sam fails to segment anything?–sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, and more. arXiv preprint arXiv:2304.09148, 2023.
  150. Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677, 2023.
  151. Segment any anomaly without training via hybrid prompt regularization. arXiv preprint arXiv:2305.10724, 2023.
  152. Edit everything: A text-guided generative system for images editing. arXiv preprint arXiv:2304.14006, 2023.
  153. Explain any concept: Segment anything meets concept-based explanation. arXiv preprint arXiv:2305.10289, 2023.
  154. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022.
  155. P3 ranker: Mitigating the gaps between pre-training and ranking fine-tuning with prompt-based learning and pre-finetuning. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1956–1962, 2022.
  156. Reem Abdel-Salam. Dialect & sentiment identification in nuanced arabic tweets using an ensemble of prompt-based, fine-tuned, and multitask bert-based models. In Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP), pages 452–457, 2022.
  157. Deep learning, reinforcement learning, and world models. Neural Networks, 152:267–275, 2022.
  158. Exploration in deep reinforcement learning: A survey. Information Fusion, 85:1–22, 2022.
  159. Outracing champion gran turismo drivers with deep reinforcement learning. Nature, 602(7896):223–228, 2022.
  160. WH Morse and RT Kelleher. Determinants of reinforcement and punishment. In Handbook of operant behavior, pages 174–200. Routledge, 2022.
  161. Efficient adapter transfer of self-supervised speech models for automatic speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7102–7106. IEEE, 2022.
  162. On the cross-modal transfer from natural language to code through adapter modules. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pages 71–81, 2022.
  163. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11953–11962, 2022.
  164. Knowledge distillation via the target-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10915–10924, 2022.
  165. Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11933–11942, 2022.
  166. Towards agi in computer vision: Lessons learned from gpt and large language models. arXiv preprint arXiv:2306.08641, 2023.
  167. Hans Moravec. Mind children: The future of robot and human intelligence. Harvard University Press, 1988.
  168. Rodney A Brooks. Elephants don’t play chess. Robotics and autonomous systems, 6(1-2):3–15, 1990.
  169. Jun Ma and Bo Wang. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
  170. Agi for agriculture. arXiv preprint arXiv:2304.06136, 2023.
  171. Sam for poultry science. arXiv preprint arXiv:2305.10254, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (21)
  1. Jiaqi Wang (218 papers)
  2. Zhengliang Liu (91 papers)
  3. Lin Zhao (227 papers)
  4. Zihao Wu (100 papers)
  5. Chong Ma (28 papers)
  6. Sigang Yu (4 papers)
  7. Haixing Dai (39 papers)
  8. Qiushi Yang (10 papers)
  9. Yiheng Liu (24 papers)
  10. Songyao Zhang (4 papers)
  11. Enze Shi (13 papers)
  12. Yi Pan (79 papers)
  13. Tuo Zhang (46 papers)
  14. Dajiang Zhu (68 papers)
  15. Xiang Li (1002 papers)
  16. Xi Jiang (53 papers)
  17. Bao Ge (17 papers)
  18. Yixuan Yuan (67 papers)
  19. Dinggang Shen (153 papers)
  20. Tianming Liu (161 papers)
Citations (103)