Emergent Mind

Self-Discover: Large Language Models Self-Compose Reasoning Structures

Published Feb 6, 2024 in cs.AI and cs.CL


We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT). Furthermore, SELF-DISCOVER outperforms inference-intensive methods such as CoT-Self-Consistency by more than 20%, while requiring 10-40x fewer inference compute. Finally, we show that the self-discovered reasoning structures are universally applicable across model families: from PaLM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns.


  • Self-Discover is a framework for LLMs to self-compose reasoning structures enhancing their reasoning capabilities across various tasks.

  • It outperforms traditional and inference-intensive methods by up to 32%, demonstrating significant performance gains with reduced computational demands.

  • The framework is universally applicable, mimicking human reasoning patterns and is adaptable across different LLM model families.

  • Self-Discover's methodology offers a scalable, efficient, and interpretable approach to improving LLM reasoning, providing a task-specific alternative to generic prompting methods.

Introduction to Self-Discover

LLMs have been at the forefront of producing coherent texts and following instructions with a significant level of success. These models, powered by transformers, have shown potential in various applications, including text generation and task execution. As part of the ongoing efforts to enhance LLMs' reasoning capabilities, a variety of prompting methods inspired by cognitive theories have emerged. These methods, such as Chain of Thought (CoT), decomposition-based prompting, and step-back prompting, aim to mimic human problem-solving steps or break down complex problems into smaller, manageable parts. However, these techniques often operate under the assumption of a one-size-fits-all reasoning module, disregarding the unique intrinsic structure of each task. Addressing this limitation, the "Self-Discover" framework proposes a methodology for LLMs to self-compose reasoning structures tailored to individual tasks, significantly improving reasoning performance across challenging benchmarks.

Key Contributions of Self-Discover

  • Enhanced Performance on Reasoning Benchmarks: Self-Discover has demonstrated substantial improvements on various challenging reasoning tasks, such as BigBench-Hard, grounded agent reasoning, and MATH, with performance gains reaching up to 32% over traditional CoT prompting methods. Additionally, it outperformed inference-intensive methods like CoT-Self-Consistency by over 20%, with significantly reduced computational demands.

  • Computational Efficiency: The framework's efficiency is highlighted through its modest requirement of only 3 additional inference steps at the task level, a drastic reduction compared to methods demanding 10-40 times more inference compute.

  • Transferability and Universality: The self-discovered reasoning structures are not only universally applicable across different model families but also exhibit similarities with human reasoning patterns. This underscores the framework's adaptability and its potential to enhance reasoning tasks across various LLM implementations.

  • Interpretability: By grounding the discovered reasoning structures in atomic reasoning modules, Self-Discover provides interpretable insights into LLMs’ task-solving strategies. This is a notable advantage over methods relying on less transparent optimized prompts.

Experimental Setup and Findings

The Self-Discover framework was rigorously tested across a set of 25 reasoning tasks drawn from benchmarks like BBH, T4D, and MATH. Utilizing state-of-the-art models such as GPT-4 and PaLM 2-L, the framework significantly outperformed existing prompting methods across these tasks. Notably, on the T4D task, Self-Discover achieved over 85% accuracy with GPT-4, showcasing its remarkable efficiency and effectiveness.

Implications and Future Directions

The research introduces an innovative approach to reasoning in LLMs, moving away from the reliance on generic prompting methods to a more task-specific, self-composed reasoning structures. This not only enhances the performance and efficiency of LLMs but also provides a more interpretable method of understanding model reasoning. Looking forward, the potential of Self-Discover to adapt and improve across various LLM architectures opens new avenues for research, especially in domains where reasoning and complex problem-solving are crucial. The framework's ability to mimic human reasoning patterns presents exciting opportunities for human-AI collaboration, further pushing the boundaries of what AI can achieve.


"Self-Discover" marks a significant step forward in LLM reasoning, offering a scalable and efficient methodology for self-composing reasoning structures. Its success across challenging benchmarks, combined with computational efficiency and the universality of its application, underscores the potential of LLMs to tackle complex reasoning tasks. As AI continues to evolve, frameworks like Self-Discover are pivotal in harnessing the true reasoning capabilities of LLMs, offering insights and directions for future research in the field.

Get summaries of trending AI/ML papers delivered straight to your inbox

Unsubscribe anytime.

  1. PaLM 2 Technical Report
  2. Graph of Thoughts: Solving Elaborate Problems with Large Language Models
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901
  4. Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models
  5. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
  6. PaLM: Scaling Language Modeling with Pathways
  7. Scaling Instruction-Finetuned Language Models
  8. Training Verifiers to Solve Math Word Problems
  9. Compositional Semantic Parsing with Large Language Models
  10. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
  11. StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving
  12. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023b.
  13. Reasoning with Language Model is Planning with World Model
  14. Measuring mathematical problem solving with the math dataset. Sort, 2(4):0–6.
  15. Measuring mathematical problem solving with the math dataset
  16. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations
  17. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213
  18. Less is more: Summary of long instructions is better for program synthesis. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  4532–4552
  19. Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts
  20. HELP ME THINK: A simple prompting strategy for non-experts to create customized content with models. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  11834–11890, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.751. https://aclanthology.org/2023.findings-acl.751.

  21. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5807–5832, 2022a.
  22. Reframing instructional prompts to gptk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  589–612, 2022b.
  23. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3470–3487, 2022c.
  24. Elements of a theory of human problem solving. Psychological review, 65(3):151
  25. Show Your Work: Scratchpads for Intermediate Computation with Language Models
  26. OpenAI. Chatgpt: Optimizing language models for dialogue, 2022. https://openai.com/blog/chatgpt/.

  27. OpenAI. Json generation mode, 2023a. https://platform.openai.com/docs/guides/text-generation/json-mode.

  28. OpenAI, R. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023b.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  30. Is a question decomposition unit all we need? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  4553–4569
  31. Polya, G. How to solve it: A new aspect of mathematical method, volume 85. Princeton university press
  32. Rasmussen, J. Skills, rules, and knowledge; signals, signs, and symbols, and other distinctions in human performance models. IEEE transactions on systems, man, and cybernetics, (3):257–266
  33. Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
  34. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research
  35. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
  36. Llama 2: Open Foundation and Fine-Tuned Chat Models
  37. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

  38. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
  39. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations
  40. Finetuned language models are zero-shot learners. In International Conference on Learning Representations
  41. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837
  42. Large Language Models as Optimizers
  43. Tree of Thoughts: Deliberate Problem Solving with Large Language Models
  44. Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models
  45. Large Language Models as Analogical Reasoners
  46. Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models
  47. Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections
  48. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2022a.
  49. How FaR Are Large Language Models From Agents with Theory-of-Mind?
  50. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2022b.

Show All 50