Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering (2209.09513v2)

Published 20 Sep 2022 in cs.CL, cs.AI, cs.CV, cs.LG, and cs.MM

Abstract: When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale LLMs. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design LLMs to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in LLMs, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that LLMs, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https://scienceqa.github.io.

PDF Abstract

Multimodal Reasoning via Thought Chains for Science Question Answering

The paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering" presents significant advancements in the field of science question answering by introducing ScienceQA, a comprehensive benchmark designed to evaluate the interpretability and multi-hop reasoning capabilities of AI systems.

ScienceQA stands out by incorporating multimodal multiple-choice questions sourced from diverse science topics, accompanied by annotations, lectures, and explanations. This dataset consists of approximately 21,000 questions categorized into natural science, social science, and language science, fulfilling a critical gap in existing datasets that predominantly focus on textual-only modalities or lack adequate annotations.

The authors design models to generate lectures and explanations alongside providing correct answers, with the intent to mimic the chain of thought (CoT) reasoning process emulated by humans. The integration of CoT allows the models not only to enhance their accuracy but also to improve interpretability. The experimental evidence from this research highlights that the CoT framework improves performance metrics by 1.20% for few-shot GPT-3 and 3.99% for fine-tuned UnifiedQA, with the generated explanations meeting gold standards according to 65.2% of human evaluations.

A key aspect explored is the models' ability to leverage explanations to boost performance further. By incorporating explanations in the input, the few-shot performance of GPT-3 improves dramatically by 18.96%, indicating the substantial underutilization of explanatory data within traditional frameworks.

The theoretical implications of this research lie in its portrayal of the CoT framework as a potential paradigm shift in AI reasoning, particularly in complex task environments. Practically, the ability to utilize fewer data to achieve comparable performance results presents a cost-effective and efficient avenue for deploying AI models in educational applications.

Future developments in AI will likely explore the scalability of these findings across other domains and further refine the integration of multimodal inputs. The research suggests that a deeper investigation into structured reasoning processes can result in performance comparable, if not superior, to that of human reasoning in nuanced contexts.

In conclusion, the paper contributes essential insights for the field of AI, pushing forward the boundaries of how models can be trained to reason in a manner analogous to human cognition. The ScienceQA dataset and derived methodologies present pivotal tools for researchers aiming to delve into the intersection of interpretability and effectiveness in AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Pan Lu (42 papers)
Swaroop Mishra (60 papers)
Tony Xia (5 papers)
Liang Qiu (36 papers)
Kai-Wei Chang (292 papers)
Song-Chun Zhu (216 papers)
Oyvind Tafjord (49 papers)
Peter Clark (108 papers)
Ashwin Kalyan (26 papers)

Citations (819)

View on Semantic Scholar

Related Papers

Find Related Papers

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering (2209.09513v2)

Multimodal Reasoning via Thought Chains for Science Question Answering

Related Papers

GitHub