Papers
Topics
Authors
Recent
2000 character limit reached

Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision (2402.02658v2)

Published 5 Feb 2024 in cs.AI, cs.CL, and cs.LG

Abstract: Process supervision, using a trained verifier to evaluate the intermediate steps generated by a reasoner, has demonstrated significant improvements in multi-step problem solving. In this paper, to avoid the expensive effort of human annotation on the verifier training data, we introduce Model-induced Process Supervision (MiPS), a novel method for automating data curation. MiPS annotates an intermediate step by sampling completions of this solution through the reasoning model, and obtaining an accuracy defined as the proportion of correct completions. Inaccuracies of the reasoner would cause MiPS underestimating the accuracy of intermediate steps, therefore, we suggest and empirically show that verification focusing on high predicted scores of the verifier shall be preferred over that of low predicted scores, contrary to prior observations on human curated data. Our approach significantly improves the performance of PaLM 2 on math and coding tasks (accuracy +0.67% on GSM8K, +4.16% on MATH, +0.92% on MBPP compared with an output supervision trained verifier). Additionally, our study demonstrates that the verifier exhibits strong generalization ability across different reasoning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  3. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  4. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  5. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
  6. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023a.
  7. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023a.
  8. Learning math reasoning from self-sampled correct and partially-correct solutions. In The Eleventh International Conference on Learning Representations, 2022.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  11. Making large language models better reasoners with step-aware verifier. arXiv preprint arXiv:2206.02336, 2022.
  12. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
  13. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904, 2023.
  14. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  15. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023.
  16. Outcome-supervised verifiers for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2023b.
  17. Tinygsm: achieving> 80% on gsm8k with small language models. arXiv preprint arXiv:2312.09241, 2023a.
  18. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023.
  19. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  20. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023.
  21. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188, 2023.
  22. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
  23. Critique ability of large language models. arXiv preprint arXiv:2310.04815, 2023b.
  24. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  25. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  26. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  27. Making ppo even better: Value-guided monte-carlo tree search decoding. arXiv preprint arXiv:2309.15028, 2023b.
  28. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  29. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
  30. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  31. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  32. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  33. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
Citations (10)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.