Measuring and Narrowing the Compositionality Gap in Language Models (2210.03350v3)

Published 7 Oct 2022 in cs.CL

Abstract: We investigate the ability of LLMs to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.

References (62)

Authors (6)

Ofir Press (21 papers)
Muru Zhang (9 papers)
Sewon Min (45 papers)
Ludwig Schmidt (80 papers)
Noah A. Smith (224 papers)
Mike Lewis (78 papers)

Citations (463)

View on Semantic Scholar

Summary

The paper introduces the compositionality gap concept, showing that even larger models struggle with integrating multi-hop answers.
It employs elicitive prompting techniques, including chain-of-thought and self-ask, to explicitly improve compositional reasoning.
Evaluations across diverse datasets reveal that structured prompting can bridge the gap between mere memorization and genuine reasoning.

An Evaluation of the Compositionality Gap in LLMs

The paper "Measuring and Narrowing the Compositionality Gap in LLMs" investigates the ability of LLMs (LMs) to perform compositional reasoning tasks. Compositional reasoning requires the model to integrate answers to sub-problems to arrive at a solution for a larger problem. The authors introduce the term "compositionality gap" to denote instances where a model answers sub-problems correctly but fails to combine these into the overall solution. The compositionality gap is quantitatively measured using multi-hop questions, where answers necessitate the synthesis of multiple separate facts. These facts are often unlikely to have been encountered together during the model's pretraining phase.

Main Findings

One of the paper's main findings is that the compositionality gap remains persistent even as model size increases. Specifically, the GPT-3 models demonstrate improved performance on single-hop questions with increased scale; however, this improvement does not translate to a reduction in the compositionality gap on multi-hop questions. This observation suggests that larger models are more effective at memorizing and recalling facts rather than reasoning with them compositionally.

The authors introduce a new dataset, Compositional Celebrities (CC), which consists of 8.6k 2-hop questions designed to evaluate this gap. The dataset is carefully constructed to ensure that questions derive from facts that are usually stated separately, highlighting compositional reasoning rather than memorization.

Methodological Advancements

In addressing the compositionality gap, the authors explore elicitive prompting strategies. These strategies include the chain of thought prompting and a novel method termed self-ask. Self-ask involves the model asking itself follow-up questions and then answering these questions before arriving at the final answer. This approach encourages explicit reasoning within the model. The structured prompting of self-ask also integrates efficiently with external search engines, thereby improving the model's performance by retrieving more factual knowledge during the reasoning process.

The authors evaluate their methods across various datasets, including two existing ones—2WikiMultiHopQA and Musique—and a smaller, manually created dataset Bamboogle. This robust evaluation shows that the self-ask approach significantly enhances the model's ability to solve complex compositional questions compared to traditional and simpler casting methods like direct prompting.

Implications and Future Directions

The persistent compositionality gap identified in this paper has significant implications for the development and application of LLMs. It indicates a potential limitation in current approaches that emphasize model scaling without adequately enhancing compositional reasoning capabilities. The proposed self-ask method suggests a path toward addressing this limitation by promoting explicit reasoning strategies. It implies that fostering structured, iterative reasoning in models could be more beneficial than unexplored extensive scaling.

Future work in this domain could focus on refining these prompting strategies and analyzing their effects on even larger models, higher-order compositional tasks, or other NLP challenges. Furthermore, the potential integration with real-time data sources or retrieval systems, as demonstrated in the synergy between self-ask and search engines, presents an exciting avenue for enhancing practical LM applications that require up-to-date and accurate information synthesis.

In summary, this paper provides critical insights into the challenges and potential strategies for enhancing compositional reasoning in LLMs. It advocates a nuanced approach that combines reasoning with retrieval, suggesting that structured elicitive prompts may be essential for bridging the gap between sheer factual knowledge and genuine reasoning capabilities in AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/tomekkorbak/status/1793317053543199139

https://twitter.com/CptRandlelwa/status/1918943003588104668

YouTube

Show All Videos