Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Discover: Large Language Models Self-Compose Reasoning Structures (2402.03620v1)

Published 6 Feb 2024 in cs.AI and cs.CL

Abstract: We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT). Furthermore, SELF-DISCOVER outperforms inference-intensive methods such as CoT-Self-Consistency by more than 20%, while requiring 10-40x fewer inference compute. Finally, we show that the self-discovered reasoning structures are universally applicable across model families: from PaLM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns.

Self-Discover: Enhancing LLM Reasoning Through Self-Discovered Reasoning Structures

Introduction to Self-Discover

LLMs have been at the forefront of producing coherent texts and following instructions with a significant level of success. These models, powered by transformers, have shown potential in various applications, including text generation and task execution. As part of the ongoing efforts to enhance LLMs' reasoning capabilities, a variety of prompting methods inspired by cognitive theories have emerged. These methods, such as Chain of Thought (CoT), decomposition-based prompting, and step-back prompting, aim to mimic human problem-solving steps or break down complex problems into smaller, manageable parts. However, these techniques often operate under the assumption of a one-size-fits-all reasoning module, disregarding the unique intrinsic structure of each task. Addressing this limitation, the "Self-Discover" framework proposes a methodology for LLMs to self-compose reasoning structures tailored to individual tasks, significantly improving reasoning performance across challenging benchmarks.

Key Contributions of Self-Discover

  • Enhanced Performance on Reasoning Benchmarks: Self-Discover has demonstrated substantial improvements on various challenging reasoning tasks, such as BigBench-Hard, grounded agent reasoning, and MATH, with performance gains reaching up to 32% over traditional CoT prompting methods. Additionally, it outperformed inference-intensive methods like CoT-Self-Consistency by over 20%, with significantly reduced computational demands.
  • Computational Efficiency: The framework's efficiency is highlighted through its modest requirement of only 3 additional inference steps at the task level, a drastic reduction compared to methods demanding 10-40 times more inference compute.
  • Transferability and Universality: The self-discovered reasoning structures are not only universally applicable across different model families but also exhibit similarities with human reasoning patterns. This underscores the framework's adaptability and its potential to enhance reasoning tasks across various LLM implementations.
  • Interpretability: By grounding the discovered reasoning structures in atomic reasoning modules, Self-Discover provides interpretable insights into LLMs’ task-solving strategies. This is a notable advantage over methods relying on less transparent optimized prompts.

Experimental Setup and Findings

The Self-Discover framework was rigorously tested across a set of 25 reasoning tasks drawn from benchmarks like BBH, T4D, and MATH. Utilizing state-of-the-art models such as GPT-4 and PaLM 2-L, the framework significantly outperformed existing prompting methods across these tasks. Notably, on the T4D task, Self-Discover achieved over 85% accuracy with GPT-4, showcasing its remarkable efficiency and effectiveness.

Implications and Future Directions

The research introduces an innovative approach to reasoning in LLMs, moving away from the reliance on generic prompting methods to a more task-specific, self-composed reasoning structures. This not only enhances the performance and efficiency of LLMs but also provides a more interpretable method of understanding model reasoning. Looking forward, the potential of Self-Discover to adapt and improve across various LLM architectures opens new avenues for research, especially in domains where reasoning and complex problem-solving are crucial. The framework's ability to mimic human reasoning patterns presents exciting opportunities for human-AI collaboration, further pushing the boundaries of what AI can achieve.

Conclusion

"Self-Discover" marks a significant step forward in LLM reasoning, offering a scalable and efficient methodology for self-composing reasoning structures. Its success across challenging benchmarks, combined with computational efficiency and the universality of its application, underscores the potential of LLMs to tackle complex reasoning tasks. As AI continues to evolve, frameworks like Self-Discover are pivotal in harnessing the true reasoning capabilities of LLMs, offering insights and directions for future research in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Skills-in-context prompting: Unlocking compositionality in large language models. arXiv preprint arXiv:2308.00304, 2023.
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  9. Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003, 2022.
  10. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023.
  11. Strategyllm: Large language models as strategy generators, executors, optimizers, and evaluators for problem solving. arXiv preprint arXiv:2311.08803, 2023a.
  12. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023b.
  13. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  14. Measuring mathematical problem solving with the math dataset. Sort, 2(4):0–6.
  15. Measuring mathematical problem solving with the math dataset, 2021.
  16. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2022.
  17. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  18. Less is more: Summary of long instructions is better for program synthesis. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  4532–4552, 2022.
  19. Plan, verify and switch: Integrated reasoning with diverse x-of-thoughts. arXiv preprint arXiv:2310.14628, 2023.
  20. HELP ME THINK: A simple prompting strategy for non-experts to create customized content with models. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  11834–11890, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.751. URL https://aclanthology.org/2023.findings-acl.751.
  21. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5807–5832, 2022a.
  22. Reframing instructional prompts to gptk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  589–612, 2022b.
  23. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3470–3487, 2022c.
  24. Elements of a theory of human problem solving. Psychological review, 65(3):151, 1958.
  25. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  26. OpenAI. Chatgpt: Optimizing language models for dialogue, 2022. URL https://openai.com/blog/chatgpt/.
  27. OpenAI. Json generation mode, 2023a. URL https://platform.openai.com/docs/guides/text-generation/json-mode.
  28. OpenAI, R. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023b.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  30. Is a question decomposition unit all we need? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  4553–4569, 2022.
  31. Polya, G. How to solve it: A new aspect of mathematical method, volume 85. Princeton university press, 2004.
  32. Rasmussen, J. Skills, rules, and knowledge; signals, signs, and symbols, and other distinctions in human performance models. IEEE transactions on systems, man, and cybernetics, (3):257–266, 1983.
  33. Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123, 2023.
  34. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023.
  35. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  37. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  38. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023.
  39. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.
  40. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  41. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  42. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  43. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023a.
  44. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582, 2023b.
  45. Large language models as analogical reasoners. arXiv preprint arXiv:2310.01714, 2023.
  46. Take a step back: Evoking reasoning via abstraction in large language models. arXiv preprint arXiv:2310.06117, 2023.
  47. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. arXiv preprint arXiv:2104.04670, 2021.
  48. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2022a.
  49. How far are large language models from agents with theory-of-mind? arXiv preprint arXiv:2310.03051, 2023.
  50. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Pei Zhou (30 papers)
  2. Jay Pujara (44 papers)
  3. Xiang Ren (194 papers)
  4. Xinyun Chen (80 papers)
  5. Heng-Tze Cheng (16 papers)
  6. Quoc V. Le (128 papers)
  7. Ed H. Chi (74 papers)
  8. Denny Zhou (65 papers)
  9. Swaroop Mishra (60 papers)
  10. Huaixiu Steven Zheng (11 papers)
Citations (35)
Youtube Logo Streamline Icon: https://streamlinehq.com