Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Just Ask One More Time! Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios (2311.08154v3)

Published 14 Nov 2023 in cs.CL and cs.AI

Abstract: Although chain-of-thought (CoT) prompting combined with LLMs has achieved encouraging results on complex reasoning tasks, the naive greedy decoding used in CoT prompting usually causes the repetitiveness and local optimality. To address this shortcoming, ensemble-optimization tries to obtain multiple reasoning paths to get the final answer assembly. However, current ensemble-optimization methods either simply employ rule-based post-processing such as \textit{self-consistency}, or train an additional model based on several task-related human annotations to select the best one among multiple reasoning paths, yet fail to generalize to realistic settings where the type of input questions is unknown or the answer format of reasoning paths is unknown. To avoid their limitations, we propose \textbf{Self-Agreement}, a generalizable ensemble-optimization method applying in almost all scenarios where the type of input questions and the answer format of reasoning paths may be known or unknown. Self-agreement firstly samples from LLM's decoder to generate a \textit{diverse} set of reasoning paths, and subsequently prompts the LLM \textit{one more time} to determine the optimal answer by selecting the most \textit{agreed} answer among the sampled reasoning paths. Self-agreement simultaneously achieves remarkable performance on six public reasoning benchmarks and superior generalization capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007.
  3. Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311.
  4. Batch prompting: Efficient inference with large language model apis. arXiv preprint arXiv:2301.08721.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  8. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246.
  9. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
  10. Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. arXiv preprint arXiv:1707.02633.
  11. Kwaiyiimath: Technical report. arXiv preprint arXiv:2310.07488.
  12. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
  13. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  14. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2.
  15. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700.
  16. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  17. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857.
  18. How long can open-source llms truly promise on context length?
  19. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333.
  20. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118.
  21. Batchprompt: Accomplish more with less. arXiv preprint arXiv:2309.00384.
  22. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
  23. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  24. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610.
  25. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
  26. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  27. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  28. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191.
  29. Reasoning like program executors.
  30. Reasoning with language model prompting: A survey. arXiv preprint arXiv:2212.09597.
  31. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  33. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics.
  34. Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413.
  35. Automatic prompt augmentation and selection with chain-of-thought from labeled data. arXiv preprint arXiv:2302.12822.
  36. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  37. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
  38. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  40. Attention is all you need. Advances in neural information processing systems, 30.
  41. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  42. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  43. Albert Webson and Ellie Pavlick. 2021. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247.
  44. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  45. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  46. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039.
  47. Human parity on commonsenseqa: Augmenting self-attention with external attention. arXiv preprint arXiv:2112.03254.
  48. Large language models as optimizers. arXiv preprint arXiv:2309.03409.
  49. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712.
  50. Skymath: Technical report. arXiv preprint arXiv:2310.16713.
  51. Nature language reasoning, a survey. arXiv preprint arXiv:2303.14725.
  52. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  53. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  54. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
  55. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
  56. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.
  57. Meta-cot: Generalizable chain-of-thought prompting in mixed-task scenarios with large language models. arXiv preprint arXiv:2310.06692.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Lei Lin (42 papers)
  2. Jiayi Fu (10 papers)
  3. Pengli Liu (2 papers)
  4. Qingyang Li (46 papers)
  5. Yan Gong (118 papers)
  6. Junchen Wan (7 papers)
  7. Fuzheng Zhang (60 papers)
  8. Zhongyuan Wang (105 papers)
  9. Di Zhang (230 papers)
  10. Kun Gai (125 papers)
Citations (3)