Visual Attention Prompted Prediction and Learning (2310.08420v3)
Abstract: Visual explanation (attention)-guided learning uses not only labels but also explanations to guide model reasoning process. While visual attention-guided learning has shown promising results, it requires a large number of explanation annotations that are time-consuming to prepare. However, in many real-world situations, it is usually desired to prompt the model with visual attention without model retraining. For example, when doing AI-assisted cancer classification on a medical image, users (e.g., clinicians) can provide the AI model with visual attention prompt on which areas are indispensable and which are precluded. Despite its promising objectives, achieving visual attention-prompted prediction presents several major challenges: 1) How can the visual prompt be effectively integrated into the model's reasoning process? 2) How should the model handle samples that lack visual prompts? 3) What is the impact on the model's performance when a visual prompt is imperfect? This paper introduces a novel framework for attention-prompted prediction and learning, utilizing visual prompts to steer the model's reasoning process. To improve performance in non-prompted situations and align it with prompted scenarios, we propose a co-training approach for both non-prompted and prompted models, ensuring they share similar parameters and activations. Additionally, for instances where the visual prompt does not encompass the entire input image, we have developed innovative attention prompt refinement methods. These methods interpolate the incomplete prompts while maintaining alignment with the model's explanations. Extensive experiments on four datasets demonstrate the effectiveness of our proposed framework in enhancing predictions for samples both with and without prompt.
- Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE access, 6:52138–52160, 2018.
- The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics, 38(2):915–931, 2011.
- Saliency-guided hidden associative replay for continual learning. arXiv preprint arXiv:2310.04334, 2023.
- e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31, 2018.
- Air: Attention with reasoning capability. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 91–107. Springer, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Aligning eyes between humans and deep neural network through interactive attention alignment, 2022.
- Res: A robust framework for guiding visual explanation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 432–442, 2022.
- Modality distillation with multiple stream networks for action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 103–118, 2018.
- Karush-kuhn-tucker conditions. Optimization, 10(725/36):725, 2012.
- Essa: Explanation iterative supervision via saliency-guided data augmentation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 567–576, 2023.
- Xai-class: Explanation-enhanced text classification with extremely weak supervision. arXiv preprint arXiv:2311.00189, 2023.
- Alternating back-propagation for generator network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
- Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Boosting low-data instance segmentation by unsupervised pre-training with saliency prompt. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15485–15494, 2023.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Kept: Knowledge enhanced prompt tuning for event causality identification. Knowledge-Based Systems, 259:110064, 2023.
- Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015.
- A frobenius norm approach to glottal closure detection from the speech signal. IEEE Transactions on Speech and Audio Processing, 2(2):258–265, 1994.
- Layer-wise relevance propagation: an overview. Explainable AI: interpreting, explaining and visualizing deep learning, pages 193–209, 2019.
- Wt5?! training text-to-text models to explain their predictions. arXiv preprint arXiv:2004.14546, 2020.
- Mononet: Enhancing interpretability in neural networks via monotonic features. Bioinformatics Advances, 3(1):vbad016, 2023.
- On the role of attention in prompt-tuning. arXiv preprint arXiv:2306.03435, 2023.
- No token left behind: Explainability-aided image classification and generation. In European Conference on Computer Vision, pages 334–350. Springer, 2022.
- Distilling large language models for text-attributed graph learning. arXiv preprint arXiv:2402.12022, 2024.
- Visualizing deep networks by optimizing with integrated gradients. In CVPR Workshops, volume 2, pages 1–4, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part I 18, pages 556–564. Springer, 2015.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
- Human-ai interactive and continuous sensemaking: A case study of image classification using scribble attention maps. In Extended Abstracts of CHI, pages 1–8, 2021.
- Learning using privileged information: similarity control and knowledge transfer. J. Mach. Learn. Res., 16(1):2023–2049, 2015.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Convergence rates of training deep neural networks via alternating minimization methods. Optimization Letters, pages 1–15, 2023.
- Ontology-enhanced prompt-tuning for few-shot learning. In Proceedings of the ACM Web Conference 2022, pages 778–787, 2022.
- Magi: Multi-annotated explanation-guided learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1977–1987, 2023.
- Learning deep features for discriminative localization, 2015.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.