Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Measuring and Narrowing the Compositionality Gap in Language Models (2210.03350v3)

Published 7 Oct 2022 in cs.CL

Abstract: We investigate the ability of LLMs to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Thinking aloud: Dynamic context generation improves zero-shot reasoning performance of gpt-2. ArXiv, abs/2103.13033.
  2. Adaptive neural networks for efficient inference. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 527–536. JMLR.org.
  3. Improving language models by retrieving from trillions of tokens. In ICML.
  4. Language models are few-shot learners.
  5. Ask the right questions: Active question reformulation with reinforcement learning. ArXiv, abs/1705.07830.
  6. The second conversational intelligence challenge (convai2). ArXiv, abs/1902.00098.
  7. Neural logic machines.
  8. Towards a human-like open-domain chatbot. ArXiv, abs/2001.09977.
  9. Alex Graves. 2016. Adaptive computation time for recurrent neural networks. ArXiv, abs/1603.08983.
  10. Realm: Retrieval-augmented language model pre-training. ArXiv, abs/2002.08909.
  11. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  12. Multi-scale dense networks for resource efficient image classification. In ICLR.
  13. Compositionality decomposed: How do neural networks generalise? J. Artif. Intell. Res., 67:757–795.
  14. Search-based neural structured learning for sequential question answering. In ACL.
  15. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In EACL.
  16. Realtime qa: What’s the answer right now? ArXiv, abs/2207.13332.
  17. Measuring compositional generalization: A comprehensive method on realistic data. In International Conference on Learning Representations.
  18. Generalization through memorization: Nearest neighbor language models. ArXiv, abs/1911.00172.
  19. Text modular networks: Learning to decompose tasks in the language of existing models. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page 1264–1279.
  20. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations.
  21. Large language models are zero-shot reasoners.
  22. Brenden M. Lake and Marco Baroni. 2017. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks.
  23. Retrieval-augmented generation for knowledge-intensive nlp tasks. ArXiv, abs/2005.11401.
  24. A diversity-promoting objective function for neural conversation models. In NAACL.
  25. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In ACL.
  26. Generated knowledge prompting for commonsense reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3154–3169, Dublin, Ireland. Association for Computational Linguistics.
  27. Teaching language models to support answers with verified quotes. ArXiv, abs/2203.11147.
  28. Multi-hop reading comprehension through question decomposition and rescoring. ArXiv, abs/1906.02916.
  29. Reframing instructional prompts to GPTk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pages 589–612, Dublin, Ireland. Association for Computational Linguistics.
  30. Webgpt: Browser-assisted question-answering with human feedback. ArXiv, abs/2112.09332.
  31. Show your work: Scratchpads for intermediate computation with language models.
  32. Is a question decomposition unit all we need? ArXiv, abs/2205.12538.
  33. Unsupervised question decomposition for question answering. In EMNLP.
  34. Answering complex open-domain questions through iterative query generation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
  35. Sudha Rao and Hal Daumé III. 2019. Answer-based Adversarial Training for Generating Clarification Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 143–155, Minneapolis, Minnesota. Association for Computational Linguistics.
  36. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
  37. Closed ai models make bad baselines.
  38. Recipes for building an open-domain chatbot. In EACL.
  39. Knowledge-aware language model pretraining. ArXiv, abs/2007.00655.
  40. The right tool for the job: Matching model and instance complexities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6640–6651, Online. Association for Computational Linguistics.
  41. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. In NeurIPS.
  42. Neural speed reading via skim-rnn.
  43. Generative deep neural networks for dialogue: A short review. ArXiv, abs/1611.06216.
  44. Neural responding machine for short-text conversation. ArXiv, abs/1503.02364.
  45. Unsupervised commonsense question answering with self-talk. ArXiv, abs/2004.05483.
  46. A neural network approach to context-sensitive generation of conversational responses. In NAACL.
  47. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ArXiv, abs/2206.04615.
  48. Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In NAACL.
  49. olmpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics, 8:743–758.
  50. Lamda: Language models for dialog applications. ArXiv, abs/2201.08239.
  51. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
  52. Oriol Vinyals and Quoc V. Le. 2015. A neural conversational model. ArXiv, abs/1506.05869.
  53. Shepherd pre-trained language models to develop a train of thought: An iterative prompting approach.
  54. Skipnet: Learning dynamic routing in convolutional networks. ArXiv, abs/1711.09485.
  55. Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. In NeurIPS.
  56. Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
  57. Chain of thought prompting elicits reasoning in large language models.
  58. Break it down: A question understanding benchmark. Transactions of the Association for Computational Linguistics, 8:183–198.
  59. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
  60. The unreliability of explanations in few-shot in-context learning. ArXiv, abs/2205.03401.
  61. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL.
  62. Least-to-most prompting enables complex reasoning in large language models. ArXiv, abs/2205.10625.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ofir Press (21 papers)
  2. Muru Zhang (9 papers)
  3. Sewon Min (45 papers)
  4. Ludwig Schmidt (80 papers)
  5. Noah A. Smith (224 papers)
  6. Mike Lewis (78 papers)
Citations (463)

Summary

  • The paper introduces the compositionality gap concept, showing that even larger models struggle with integrating multi-hop answers.
  • It employs elicitive prompting techniques, including chain-of-thought and self-ask, to explicitly improve compositional reasoning.
  • Evaluations across diverse datasets reveal that structured prompting can bridge the gap between mere memorization and genuine reasoning.

An Evaluation of the Compositionality Gap in LLMs

The paper "Measuring and Narrowing the Compositionality Gap in LLMs" investigates the ability of LLMs (LMs) to perform compositional reasoning tasks. Compositional reasoning requires the model to integrate answers to sub-problems to arrive at a solution for a larger problem. The authors introduce the term "compositionality gap" to denote instances where a model answers sub-problems correctly but fails to combine these into the overall solution. The compositionality gap is quantitatively measured using multi-hop questions, where answers necessitate the synthesis of multiple separate facts. These facts are often unlikely to have been encountered together during the model's pretraining phase.

Main Findings

One of the paper's main findings is that the compositionality gap remains persistent even as model size increases. Specifically, the GPT-3 models demonstrate improved performance on single-hop questions with increased scale; however, this improvement does not translate to a reduction in the compositionality gap on multi-hop questions. This observation suggests that larger models are more effective at memorizing and recalling facts rather than reasoning with them compositionally.

The authors introduce a new dataset, Compositional Celebrities (CC), which consists of 8.6k 2-hop questions designed to evaluate this gap. The dataset is carefully constructed to ensure that questions derive from facts that are usually stated separately, highlighting compositional reasoning rather than memorization.

Methodological Advancements

In addressing the compositionality gap, the authors explore elicitive prompting strategies. These strategies include the chain of thought prompting and a novel method termed self-ask. Self-ask involves the model asking itself follow-up questions and then answering these questions before arriving at the final answer. This approach encourages explicit reasoning within the model. The structured prompting of self-ask also integrates efficiently with external search engines, thereby improving the model's performance by retrieving more factual knowledge during the reasoning process.

The authors evaluate their methods across various datasets, including two existing ones—2WikiMultiHopQA and Musique—and a smaller, manually created dataset Bamboogle. This robust evaluation shows that the self-ask approach significantly enhances the model's ability to solve complex compositional questions compared to traditional and simpler casting methods like direct prompting.

Implications and Future Directions

The persistent compositionality gap identified in this paper has significant implications for the development and application of LLMs. It indicates a potential limitation in current approaches that emphasize model scaling without adequately enhancing compositional reasoning capabilities. The proposed self-ask method suggests a path toward addressing this limitation by promoting explicit reasoning strategies. It implies that fostering structured, iterative reasoning in models could be more beneficial than unexplored extensive scaling.

Future work in this domain could focus on refining these prompting strategies and analyzing their effects on even larger models, higher-order compositional tasks, or other NLP challenges. Furthermore, the potential integration with real-time data sources or retrieval systems, as demonstrated in the synergy between self-ask and search engines, presents an exciting avenue for enhancing practical LM applications that require up-to-date and accurate information synthesis.

In summary, this paper provides critical insights into the challenges and potential strategies for enhancing compositional reasoning in LLMs. It advocates a nuanced approach that combines reasoning with retrieval, suggesting that structured elicitive prompts may be essential for bridging the gap between sheer factual knowledge and genuine reasoning capabilities in AI systems.

Youtube Logo Streamline Icon: https://streamlinehq.com