Chain of Thought Prompt Tuning in Vision Language Models (2304.07919v2)
Abstract: Language-Image Pre-training has demonstrated promising results on zero-shot and few-shot downstream tasks by prompting visual models with natural language prompts. However, most recent studies only use a single prompt for tuning, neglecting the inherent step-to-step cognitive reasoning process that humans conduct in complex task settings, for example, when processing images from unfamiliar domains. Chain of Thought is a simple and effective approximation to human reasoning process and has been proven useful for NLP tasks. Based on this cognitive intuition, we believe that conducting effective reasoning is also an important problem in visual tasks, and a chain of thought could be a solution to this problem. In this work, we propose a novel chain of thought prompt tuning for vision-LLMing. Extensive experiments show that our method not only generalizes better in image classification tasks, has greater transferability beyond a single dataset, and has stronger domain generalization performance, but also performs much better in imagetext retrieval and visual question answering, which require more reasoning capabilities. We are the first to successfully adapt chain-of-thought prompting that combines visual and textual embeddings. We will release our codes
- Y Bengio A Goyal. Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A, 2022.
- Alex Lamb Kartikeya Badola Nan Rosemary Ke Nasim Rahaman Jonathan Binas Charles Blundell Michael Mozer Yoshua Bengio Anirudh Goyal, Aniket Didolkar. Coordination among neural modules through a shared global workspace. ICLR, 2022.
- Yucheng Han Yue Wu Beier Zhu, Yulei Niu and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. arXiv preprint arXiv:2205.14865, 2022.
- Ludwig Schmidt Benjamin Recht, Rebecca Roelofs and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? ICML, 2019.
- Rami Al-Rfou Brian Lester and Noah Constant. The power of scale for parameter-efficient prompt tuning. Conference on Empirical Methods in Natural Language Processing, 2021.
- Chris M Cervantes Juan C Caicedo Julia Hockenmaier Bryan A Plummer, Liwei Wang and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer imageto-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649., 2015.
- Adaptformer: Adapting vision transformers for scalable visual recognition. 2022.
- Norman Mu Saurav Kadavath Frank Wang Evan Dorundo Rahul Desai Tyler Zhu Samyak Parajuli Mike Guo Dawn Song Jacob Steinhardt and Justin Gilmer. Dan Hendrycks, Steven Basart. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021.
- Steven Basart Jacob Steinhardt Dan Hendrycks, Kevin Zhao and Dawn Song. Natural adversarial examples. CVPR, 2021.
- Flamingo: a visual language model for few-shot learning. arxiv preprint. arXiv:2204.14198, 2022.
- Making pre-trained language models better few-shot learners. ArXiv, abs/2012.15723, 2021.
- Zachary Lipton Haohan Wang, Songwei Ge and Eric P Xingr. Learning robust global representations by penalizing local predictive power. NeurIPSL, 2019.
- Stephanie C. Y. Chan Antonia Creswell Dharshan Kumaran James L. McClelland Felix Hill Ishita Dasgupta, Andrew K. Lampinen. Language models show human-like content effects on reasoning. arXiv:2207.07051, 2022.
- Richard Socher Li-Jia Li Kai Li Jia Deng, Wei Dong and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. CVPR, 2009.
- Krista A Ehinger-Aude Oliva Jianxiong Xiao, James Hays and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. CVPR, 2010.
- Jia Deng Jonathan Krause, Michael Stark and Li Fei-Fei. 3d object representations for fine-grained categorization. ICCV-W, 2013.
- Amir Roshan Zamir Khurram Soomro and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Align and prompt: Video-and-language pre-training with entity prompts. ArXiv, abs/2112.09583, 2021.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVPR-W, 2004.
- Prompt distribution learning. 2022.
- Yi-Ling Chen-Noel Codella Xiyang Dai Jianfeng Gao Houdong Hu Xuedong Huang Boxin Li Chunyuan Li Ce Liu Mengchen Liu Zicheng Liu Yumao Lu Yu Shi Lijuan Wang Jianfeng Wang Bin Xiao Zhen Xiao Jianwei Yang Michael Zeng Luowei Zhou Pengchuan Zhang Lu Yuan, Dongdong Chen. Florence: A new foundation model for computer vision. arXiv:2111.11432, 2021.
- Food-101–mining discriminative components with random forests. ECCV, 2014.
- Joshua B. Tenenbaum Brenden M. Lake Maxwell Nye, Michael Henry Tessler. Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning. NeurIPS, 2021.
- Iasonas Kokkinos Sammy Mohamed Mircea Cimpoi, Subhransu Maji and Andrea Vedaldi. Describing textures in the wild. CVPR, 2014.
- Automated flower classification over a large number of classes. ICVGIP, 2008.
- Andrew Zisserman Omkar M Parkhi, Andrea Vedaldi and CV Jawahar. Cats and dogs. CVPR, 2012.
- Tony Xia Liang Qiu Kai-Wei Chang Song-Chun Zhu Oyvind Tafjord Peter Clark Ashwin Kalyan Pan Lu, Swaroop Mishra. Learn to explain: Multimodal reasoning via thought chains for science question answering. Neurips, 2022.
- Andreas Dengel Patrick Helber, Benjamin Bischke and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
- Chris Buehler Damien Teney Mark Johnson Stephen Gould Peter Anderson, Xiaodong He and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. CVPR, 2018.
- Learning transferable visual models from natural language supervision. international conference on machine learning, 2021.
- Hao Tan Mohit Bansal Anna Rohrbach Kai-Wei Chang Zhewei Yao Sheng Shen, Liunian Harold Li and Kurt Keutzer. How much can clip benefit vision-and-language tasks? arXiv:2107.06383, 2021.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. empirical methods in natural language processing, 2020.
- Juho Kannala Matthew Blaschko Subhransu Maji, Esa Rahtu and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei Tom B. Brown, Benjamin Mann. Language models are few-shot learners. arXiv:2005.14165, 2020.
- Serge Belongie James Hays Pietro Perona Deva Ramanan Piotr Dollár Tsung-Yi Lin, Michael Maire and C Lawrence Zitnick. Microsoft coco: Common objects in context. ECCV, 2014.
- Learning to prompt for continual learning. 2022.
- Chain of thought prompting elicits reasoning in large language models. 2022.
- Ted Xiao Harris Chan Jacky Liang Pete Florence Andy Zeng Jonathan Tompson Igor Mordatch Yevgen Chebotar Pierre Sermanet Noah Brown Tomas Jackson Linda Luu Sergey Levine Karol Hausman Brian Ichter Wenlong Huang, Fei Xia. Inner monologue: Embodied reasoning through planning with language models. arXiv:2207.05608, 2022.
- Filip: Fine-grained interactive language-image pre-training. 2022.
- Lit: Zero-shot transfer with locked-image text tuning. arXiv: Computer Vision and Pattern Recognition, 2021.
- Domain prompt learning for efficiently adapting clip to unseen domains. 2021.
- Conditional prompt learning for vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 2022.
- Jiaxin Ge (14 papers)
- Hongyin Luo (31 papers)
- Siyuan Qian (5 papers)
- Yulu Gan (13 papers)
- Jie Fu (229 papers)
- Shanghang Zhang (172 papers)