Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diversity of Thought Improves Reasoning Abilities of LLMs (2310.07088v2)

Published 11 Oct 2023 in cs.CL and cs.AI

Abstract: LLMs are documented to struggle in settings that require complex reasoning. Nevertheless, instructing the model to break down the problem into smaller reasoning steps, or ensembling various generations through modifying decoding steps boosts performance. However, these methods assume that the input prompt is fixed and expect the decoding strategies to introduce the diversity needed for ensembling. In this work, we discuss how one can create and leverage variations of the input prompt as a means of diversity of thought. We propose a method that automatically improves prompt diversity by soliciting feedback from the LLM to ideate approaches that are apt for the problem. We then ensemble the diverse prompts in our method DIVSE (DIVerse reasoning path Self-Ensemble) across multiple inference calls, or use diverse approaches within a single inference call; we call the latter IDIV-SE (In-call DIVerse reasoning path Self-Ensemble). Apart from our approaches outperforming prior work, DIV-SE(in particular) advances state-of-the-art performance on the challenging planning and graph coloring benchmarks. Our results improve the Pareto frontier of the accuracy-cost trade-off.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  4. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2022.
  5. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  6. Training verifiers to solve math word problems, 2021.
  7. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022a.
  8. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022b.
  9. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
  10. Large language models are zero-shot reasoners. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  11. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  12. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  13. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  158–167, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1015. URL https://aclanthology.org/P17-1015.
  14. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  15. OpenAI. Introducing chatgpt. 2022. URL https://openai.com/blog/chatgpt/.
  16. OpenAI. Gpt-4 technical report. 2023a. URL https://arxiv.org/abs/2303.08774.
  17. OpenAI. Gpt-4 technical report, 2023b.
  18. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023.
  19. In-context impersonation reveals large language models’ strengths and biases. arXiv preprint arXiv:2305.14930, 2023.
  20. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
  21. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421.
  22. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  23. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.
  24. On the planning abilities of large language models – a critical investigation, 2023.
  25. Self-consistency improves chain of thought reasoning in language models, 2023.
  26. Chain of thought prompting elicits reasoning in large language models. In Conference on Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/pdf/2201.11903.
  27. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.
  28. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  29. Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007, 2023.
  30. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.