Papers
Topics
Authors
Recent
2000 character limit reached

AutoPSV: Automated Process-Supervised Verifier (2405.16802v4)

Published 27 May 2024 in cs.CL and cs.LG

Abstract: In this work, we propose a novel method named \textbf{Auto}mated \textbf{P}rocess-\textbf{S}upervised \textbf{V}erifier (\textbf{\textsc{AutoPSV}}) to enhance the reasoning capabilities of LLMs by automatically annotating the reasoning steps. \textsc{AutoPSV} begins by training a verification model on the correctness of final answers, enabling it to generate automatic process annotations. This verification model assigns a confidence score to each reasoning step, indicating the probability of arriving at the correct final answer from that point onward. We detect relative changes in the verification's confidence scores across reasoning steps to automatically annotate the reasoning process, enabling error detection even in scenarios where ground truth answers are unavailable. This alleviates the need for numerous manual annotations or the high computational costs associated with model-induced annotation approaches. We experimentally validate that the step-level confidence changes learned by the verification model trained on the final answer correctness can effectively identify errors in the reasoning steps. We demonstrate that the verification model, when trained on process annotations generated by \textsc{AutoPSV}, exhibits improved performance in selecting correct answers from multiple LLM-generated outputs. Notably, we achieve substantial improvements across five datasets in mathematics and commonsense reasoning. The source code of \textsc{AutoPSV} is available at \url{https://github.com/rookie-joe/AutoPSV}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. OpenAI. GPT-3.5 Turbo, 2023.
  2. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  3. Mistral AI. Au large, 2023.
  4. Anthropic. Introducing the next generation of claude, 2023.
  5. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  6. Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  7. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  8. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  9. Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  10. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  11. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022.
  12. Orca: Progressive learning from complex explanation traces of GPT-4. CoRR, abs/2306.02707, 2023.
  13. Textbooks are all you need. CoRR, abs/2306.11644, 2023.
  14. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR, abs/2308.09583, 2023.
  15. Solving math word problems with process- and outcome-based feedback. CoRR, abs/2211.14275, 2022.
  16. Let’s verify step by step. CoRR, abs/2305.20050, 2023.
  17. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023.
  18. Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision. CoRR, abs/2402.02658, 2024.
  19. Making language models better reasoners with step-aware verifier. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 5315–5333. Association for Computational Linguistics, 2023.
  20. Fine-grained human feedback gives better rewards for language model training. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  21. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
  22. Outcome-supervised verifiers for planning in thematical reasoning. CoRR, abs/2311.09724, 2023.
  23. Let’s reward step by step: Step-level reward model as the navigators for reasoning. CoRR, abs/2310.10080, 2023.
  24. Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In H. Jaap van den Herik, Paolo Ciancarini, and H. H. L. M. Donkers, editors, Computers and Games, 5th International Conference, CG 2006, Turin, Italy, May 29-31, 2006. Revised Papers, volume 4630 of Lecture Notes in Computer Science, pages 72–83. Springer, 2006.
  25. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  26. Language models are few-shot learners. CoRR, abs/2005.14165, 2020.
  27. Dq-lore: Dual queries with low rank approximation re-ranking for in-context learning. arXiv preprint arXiv:2310.02954, 2023.
  28. Large language models are zero-shot reasoners. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  29. Large language models as analogical reasoners. CoRR, abs/2310.01714, 2023.
  30. SELF: language-driven self-evolution for large language model. CoRR, abs/2310.00533, 2023.
  31. Yoda: Teacher-student progressive learning for language models. arXiv preprint arXiv:2401.15670, 2024.
  32. Mixtral of experts. CoRR, abs/2401.04088, 2024.
  33. Mistral 7b. CoRR, abs/2310.06825, 2023.
  34. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023.
  35. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  36. Llava-phi: Efficient multi-modal assistant with small language model. CoRR, abs/2401.02330, 2024.
  37. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
  38. Measuring mathematical problem solving with the MATH dataset. CoRR, abs/2103.03874, 2021.
  39. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019.
  40. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8732–8740. AAAI Press, 2020.
  41. Avoiding the hypothesis-only bias in natural language inference via ensemble adversarial training. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 8281–8291. Association for Computational Linguistics, 2020.
  42. A framework for few-shot language model evaluation, 12 2023.
  43. Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  44. Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073, 2022.
  45. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. CoRR, abs/2308.03188, 2023.
  46. Bridging the gap: A survey on integrating (human) feedback for natural language generation. CoRR, abs/2305.00955, 2023.
  47. Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  48. Reflexion: language agents with verbal reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  49. Learning from correctness without prompting makes llm efficient reasoner. arXiv preprint arXiv:2403.19094, 2024.
  50. Re3: Generating longer stories with recursive reprompting and revision. CoRR, abs/2210.06774, 2022.
  51. Let’s reinforce step by step. CoRR, abs/2311.05821, 2023.
  52. CRITIC: large language models can self-correct with tool-interactive critiquing. CoRR, abs/2305.11738, 2023.
  53. A new era in software security: Towards self-healing software via large language models and formal verification. CoRR, abs/2305.14752, 2023.
  54. RARR: researching and revising what language models say, using language models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 16477–16508. Association for Computational Linguistics, 2023.
  55. Improving language models via plug-and-play retrieval feedback. CoRR, abs/2305.14002, 2023.
  56. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 1266–1279. Association for Computational Linguistics, 2022.
  57. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  58. Self-refine: Iterative refinement with self-feedback. CoRR, abs/2303.17651, 2023.
  59. LLM self defense: By self examination, llms know they are being tricked. CoRR, abs/2308.07308, 2023.
  60. Self-evaluation guided beam search for reasoning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  61. Learning to simulate natural language feedback for interactive semantic parsing. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 3149–3170. Association for Computational Linguistics, 2023.
  62. Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post, May 2023.
  63. Large language models cannot self-correct reasoning yet. CoRR, abs/2310.01798, 2023.
Citations (6)

Summary

  • The paper presents a novel methodology where a verification model employs confidence scores to automatically annotate reasoning steps.
  • It leverages relative confidence variations between consecutive steps to identify errors and optimize both annotation accuracy and computational efficiency.
  • Experimental results across GSM8K, HellaSwag, and Winogrande benchmarks demonstrate significant improvements in LLM performance.

Enhancing Reasoning in LLMs with AutoCV

The paper "AutoCV: Empowering Reasoning with Automated Process Labeling via Confidence Variation" presents a novel methodology to enhance the reasoning capabilities of LLMs by utilizing automated process labeling. In contrast to traditional approaches such as model-induced annotation methods or manual annotation, AutoCV employs a verification model trained on final answers' correctness to provide automated annotations in the reasoning process, thus optimizing both the accuracy and computational efficiency of these models.

Methodological Framework

The foundation of the AutoCV approach is rooted in leveraging a verification model to infer a confidence score for each reasoning step. This score represents the likelihood of arriving at a correct final answer, enabling the identification of errors in reasoning steps by examining relative changes in confidence across these steps. The automation of annotation through this method reduces the dependence on extensive manual annotations or computationally intensive sampling methods integral to model-induced annotation strategies.

AutoCV consists of several notable components:

  1. Outcome-Supervised Verification: Initially, a verification model is trained through outcome supervision using annotations based solely on the correctness of final answers. The verification model assigns confidence scores that estimate the probability of ultimately reaching a correct answer from any given reasoning step.
  2. Confidence Variation Detection: AutoCV calculates the relative variation in confidence scores between consecutive reasoning steps. This metric serves as a basis for annotating correct or incorrect intermediate steps, facilitating an automated labeling process.
  3. Training Verification Models: Utilizing process annotations generated via confidence variations, AutoCV bolsters the training of process-supervised enhanced verification models. These models improve the LLM's capacity to discern correct answers from among multiple candidates generated during inference.

Experimental Validation and Results

AutoCV was validated across multiple datasets, covering both mathematical and commonsense reasoning tasks, to ascertain its effectiveness and scalability. The experimental results are compelling, demonstrating significant accuracy improvements over baseline models such as Self-Consistency and exclusively outcome-supervised verifiers. Notably, the AutoCV method enhanced performance across the GSM8K dataset and other commonsense reasoning benchmarks, including the HellaSwag and Winogrande datasets. This evidences the approach's ability to enhance reasoning in varied cognitive domains, thus affirming the robustness of the generated process annotations.

Implications and Future Directions

This paper successfully integrates process supervision advantages with outcome supervision benefits by employing AutoCV's novel labeling technique. The methodology's implication is particularly significant in scenarios demanding extensive reasoning, where manual supervision becomes strenuous and computational resources are a premium. Genomic sequence analysis, drug discovery, and decision-making frameworks in autonomous systems are potential domains of application.

AutoCV's impact goes beyond enhancing current models, paving the way for future developments in LLMs by illustrating a method that optimizes the balance between computational efficiency and annotation accuracy. Future research could explore refining process annotations in more granular steps or real-time scenario application, further employing AutoCV as a step towards an ensemble-like structure of verification models, leading to improved resilience and precision in LLM outputs.

In conclusion, AutoCV represents a significant advance in LLMs' reasoning capabilities, fostering both theoretical insights and practical efficiencies in automated process labeling. As the field of artificial intelligence continues to thrive, methodologies like AutoCV will be instrumental in the drive towards more reliable, interpretable, and capable LLMs across diverse applications.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 102 likes about this paper.