Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks (2308.09033v2)

Published 17 Aug 2023 in cs.CV

Abstract: Natural Language Explanations (NLE) aim at supplementing the prediction of a model with human-friendly natural text. Existing NLE approaches involve training separate models for each downstream task. In this work, we propose Uni-NLX, a unified framework that consolidates all NLE tasks into a single and compact multi-task model using a unified training objective of text generation. Additionally, we introduce two new NLE datasets: 1) ImageNetX, a dataset of 144K samples for explaining ImageNet categories, and 2) VQA-ParaX, a dataset of 123K samples for explaining the task of Visual Question Answering (VQA). Both datasets are derived leveraging LLMs. By training on the 1M combined NLE samples, our single unified framework is capable of simultaneously performing seven NLE tasks including VQA, visual recognition and visual reasoning tasks with 7X fewer parameters, demonstrating comparable performance to the independent task-specific models in previous approaches, and in certain tasks even outperforming them. Code is at https://github.com/fawazsammani/uni-nlx

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Spice: Semantic propositional image caption evaluation. In ECCV, 2016.
  2. Bottom-up and top-down attention for image captioning and visual question answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6077–6086, 2018.
  3. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10, 2015.
  4. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In IEEvaluation@ACL, 2005.
  5. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  6. e-snli: Natural language inference with natural language explanations. In NeurIPS, 2018.
  7. Uniter: Universal image-text representation learning. In ECCV, 2020.
  8. Unifying vision-and-language tasks via text generation. In ICML, 2021.
  9. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  10. Generating visual explanations. In ECCV, 2016.
  11. e-vil: A dataset and benchmark for natural language explanations in vision-language tasks. ArXiv, abs/2105.03761, 2021.
  12. A hierarchical approach for generating descriptive image paragraphs. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3337–3345, 2017.
  13. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60:84 – 90, 2012.
  14. Vqa-e: Explaining, elaborating, and enhancing your answers for visual questions. ArXiv, abs/1803.07464, 2018.
  15. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In ACL 2004, 2004.
  16. Natural language rationales with full-stack visual reasoning: From pixels to semantic frames to commonsense graphs. In FINDINGS, 2020.
  17. Visual classification via description from large language models. International Conference on Learning Representations, abs/2210.07183, 2023.
  18. Wt5?! training text-to-text models to explain their predictions. ArXiv, abs/2004.14546, 2020.
  19. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, pages 27730–27744, 2022.
  20. Bleu: a method for automatic evaluation of machine translation. In ACL, 2001.
  21. Multimodal explanations: Justifying decisions and pointing to the evidence. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8779–8788, 2018.
  22. Connecting vision and language with localized narratives. In ECCV, 2020.
  23. What does a platypus look like? generating customized prompts for zero-shot image classification. ArXiv, abs/2209.03320, 2022.
  24. Learning transferable visual models from natural language supervision. In ICML, 2021.
  25. Language models are unsupervised multitask learners. 2019.
  26. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
  27. Nlx-gpt: A model for natural language explanations in vision and vision-language tasks. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8312–8322, 2022.
  28. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019.
  29. A-okvqa: A benchmark for visual question answering using world knowledge. European Conference on Computer Vision (ECCV), 2022.
  30. Grad-cam: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 128:336–359, 2019.
  31. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2014.
  32. Mpnet: Masked and permuted pre-training for language understanding. ArXiv, abs/2004.09297, 2020.
  33. Axiomatic attribution for deep networks. ArXiv, abs/1703.01365, 2017.
  34. Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008.
  35. Cider: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2014.
  36. Git: A generative image-to-text transformer for vision and language. Trans. Mach. Learn. Res., 2022, 2022.
  37. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, 2022.
  38. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022.
  39. Faithful multimodal explanation for visual question answering. ArXiv, abs/1809.02805, 2019.
  40. From recognition to cognition: Visual commonsense reasoning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6713–6724, 2019.
  41. Multimodal chain-of-thought reasoning in language models. ArXiv, abs/2302.00923, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Fawaz Sammani (7 papers)
  2. Nikos Deligiannis (54 papers)
Citations (5)