AutoPSV: Automated Process-Supervised Verifier (2405.16802v4)
Abstract: In this work, we propose a novel method named \textbf{Auto}mated \textbf{P}rocess-\textbf{S}upervised \textbf{V}erifier (\textbf{\textsc{AutoPSV}}) to enhance the reasoning capabilities of LLMs by automatically annotating the reasoning steps. \textsc{AutoPSV} begins by training a verification model on the correctness of final answers, enabling it to generate automatic process annotations. This verification model assigns a confidence score to each reasoning step, indicating the probability of arriving at the correct final answer from that point onward. We detect relative changes in the verification's confidence scores across reasoning steps to automatically annotate the reasoning process, enabling error detection even in scenarios where ground truth answers are unavailable. This alleviates the need for numerous manual annotations or the high computational costs associated with model-induced annotation approaches. We experimentally validate that the step-level confidence changes learned by the verification model trained on the final answer correctness can effectively identify errors in the reasoning steps. We demonstrate that the verification model, when trained on process annotations generated by \textsc{AutoPSV}, exhibits improved performance in selecting correct answers from multiple LLM-generated outputs. Notably, we achieve substantial improvements across five datasets in mathematics and commonsense reasoning. The source code of \textsc{AutoPSV} is available at \url{https://github.com/rookie-joe/AutoPSV}.
- OpenAI. GPT-3.5 Turbo, 2023.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- Mistral AI. Au large, 2023.
- Anthropic. Introducing the next generation of claude, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022.
- Orca: Progressive learning from complex explanation traces of GPT-4. CoRR, abs/2306.02707, 2023.
- Textbooks are all you need. CoRR, abs/2306.11644, 2023.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR, abs/2308.09583, 2023.
- Solving math word problems with process- and outcome-based feedback. CoRR, abs/2211.14275, 2022.
- Let’s verify step by step. CoRR, abs/2305.20050, 2023.
- Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023.
- Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision. CoRR, abs/2402.02658, 2024.
- Making language models better reasoners with step-aware verifier. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 5315–5333. Association for Computational Linguistics, 2023.
- Fine-grained human feedback gives better rewards for language model training. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
- Outcome-supervised verifiers for planning in thematical reasoning. CoRR, abs/2311.09724, 2023.
- Let’s reward step by step: Step-level reward model as the navigators for reasoning. CoRR, abs/2310.10080, 2023.
- Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In H. Jaap van den Herik, Paolo Ciancarini, and H. H. L. M. Donkers, editors, Computers and Games, 5th International Conference, CG 2006, Turin, Italy, May 29-31, 2006. Revised Papers, volume 4630 of Lecture Notes in Computer Science, pages 72–83. Springer, 2006.
- Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Language models are few-shot learners. CoRR, abs/2005.14165, 2020.
- Dq-lore: Dual queries with low rank approximation re-ranking for in-context learning. arXiv preprint arXiv:2310.02954, 2023.
- Large language models are zero-shot reasoners. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Large language models as analogical reasoners. CoRR, abs/2310.01714, 2023.
- SELF: language-driven self-evolution for large language model. CoRR, abs/2310.00533, 2023.
- Yoda: Teacher-student progressive learning for language models. arXiv preprint arXiv:2401.15670, 2024.
- Mixtral of experts. CoRR, abs/2401.04088, 2024.
- Mistral 7b. CoRR, abs/2310.06825, 2023.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Llava-phi: Efficient multi-modal assistant with small language model. CoRR, abs/2401.02330, 2024.
- Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
- Measuring mathematical problem solving with the MATH dataset. CoRR, abs/2103.03874, 2021.
- Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and LluÃs Mà rquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019.
- Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8732–8740. AAAI Press, 2020.
- Avoiding the hypothesis-only bias in natural language inference via ensemble adversarial training. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 8281–8291. Association for Computational Linguistics, 2020.
- A framework for few-shot language model evaluation, 12 2023.
- Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073, 2022.
- Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. CoRR, abs/2308.03188, 2023.
- Bridging the gap: A survey on integrating (human) feedback for natural language generation. CoRR, abs/2305.00955, 2023.
- Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Reflexion: language agents with verbal reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Learning from correctness without prompting makes llm efficient reasoner. arXiv preprint arXiv:2403.19094, 2024.
- Re3: Generating longer stories with recursive reprompting and revision. CoRR, abs/2210.06774, 2022.
- Let’s reinforce step by step. CoRR, abs/2311.05821, 2023.
- CRITIC: large language models can self-correct with tool-interactive critiquing. CoRR, abs/2305.11738, 2023.
- A new era in software security: Towards self-healing software via large language models and formal verification. CoRR, abs/2305.14752, 2023.
- RARR: researching and revising what language models say, using language models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 16477–16508. Association for Computational Linguistics, 2023.
- Improving language models via plug-and-play retrieval feedback. CoRR, abs/2305.14002, 2023.
- Maieutic prompting: Logically consistent reasoning with recursive explanations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 1266–1279. Association for Computational Linguistics, 2022.
- Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- Self-refine: Iterative refinement with self-feedback. CoRR, abs/2303.17651, 2023.
- LLM self defense: By self examination, llms know they are being tricked. CoRR, abs/2308.07308, 2023.
- Self-evaluation guided beam search for reasoning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Learning to simulate natural language feedback for interactive semantic parsing. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 3149–3170. Association for Computational Linguistics, 2023.
- Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post, May 2023.
- Large language models cannot self-correct reasoning yet. CoRR, abs/2310.01798, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.