Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can We Verify Step by Step for Incorrect Answer Detection? (2402.10528v4)

Published 16 Feb 2024 in cs.CL and cs.AI

Abstract: Chain-of-Thought (CoT) prompting has marked a significant advancement in enhancing the reasoning capabilities of LLMs. Previous studies have developed various extensions of CoT, which focus primarily on enhancing end-task performance. In addition, there has been research on assessing the quality of reasoning chains in CoT. This raises an intriguing question: Is it possible to predict the accuracy of LLM outputs by scrutinizing the reasoning chains they generate? To answer this research question, we introduce a benchmark, R2PE, designed specifically to explore the relationship between reasoning chains and performance in various reasoning tasks spanning five different domains. This benchmark aims to measure the falsehood of the final output of LLMs based on the reasoning steps. To make full use of information in multiple reasoning chains, we propose the process discernibility score (PDS) framework that beats the answer-checking baseline by a large margin. Concretely, this resulted in an average of $5.1\%$ increase in the F1 score and $2.97\%$ improvement in AUC-PR across all 45 subsets within R2PE. We further demonstrate our PDS's efficacy in advancing open-domain QA accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Graph of thoughts: Solving elaborate problems with large language models. ArXiv preprint, abs/2308.09687.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv preprint, abs/2303.12712.
  4. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv preprint, abs/2211.12588.
  5. A study of automatic metrics for the evaluation of natural language explanations. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2376–2387, Online. Association for Computational Linguistics.
  6. Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168.
  7. Agent instructs large language models to be general zero-shot reasoners. ArXiv preprint, abs/2310.03710.
  8. Active prompting with chain-of-thought for large language models. ArXiv preprint, abs/2302.12246.
  9. Everything of thoughts: Defying the law of penrose triangle for thought generation. ArXiv preprint, abs/2311.04254.
  10. Halo: Estimation and reduction of hallucinations in open-source weak large language models. ArXiv preprint, abs/2308.11764.
  11. Complexity-based prompting for multi-step reasoning. ArXiv preprint, abs/2210.00720.
  12. Kavita Ganesan. 2018. Rouge 2.0: Updated and improved measures for evaluation of summarization tasks. ArXiv preprint, abs/1803.01937.
  13. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  14. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  15. Roscoe: A suite of metrics for scoring step-by-step reasoning. ArXiv preprint, abs/2212.07919.
  16. Socreval: Large language models with the socratic method for reference-free reasoning evaluation. ArXiv preprint, abs/2310.00074.
  17. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. ArXiv preprint, abs/2111.09543.
  18. Measuring mathematical problem solving with the math dataset. ArXiv preprint, abs/2103.03874.
  19. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  20. Mixtral of experts.
  21. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  22. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  23. Chain of knowledge: A framework for grounding large language models with structured knowledge bases. ArXiv preprint, abs/2305.13269.
  24. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  25. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. ArXiv preprint, abs/2303.08896.
  26. OpenAI. 2023. Gpt-4 technical report. ArXiv preprint, abs/2303.08774.
  27. OpenAI-Blog. 2022. Chatgpt: Optimizing language models for dialogue. OpenAI Blog. [Online; accessed on 2023/12/8].
  28. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  29. Receval: Evaluating reasoning chains via correctness and informativeness. ArXiv preprint, abs/2304.10703.
  30. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  31. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. ArXiv preprint, abs/2210.01240.
  32. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. ArXiv preprint, abs/2302.00618.
  33. Automatic prompt augmentation and selection with chain-of-thought from labeled data. ArXiv preprint, abs/2302.12822.
  34. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ArXiv preprint, abs/2206.04615.
  35. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  36. Better zero-shot reasoning with self-adaptive prompting. ArXiv preprint, abs/2305.14106.
  37. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. ArXiv preprint, abs/2305.04091.
  38. Self-consistency improves chain of thought reasoning in language models. ArXiv preprint, abs/2203.11171.
  39. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  40. Naturalprover: Grounded mathematical proof generation with language models. Advances in Neural Information Processing Systems, 35:4913–4927.
  41. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  42. Reprompting: Automated chain-of-thought prompt inference through gibbs sampling. ArXiv preprint, abs/2305.09993.
  43. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
  44. Tree of thoughts: Deliberate problem solving with large language models. ArXiv preprint, abs/2305.10601.
  45. React: Synergizing reasoning and acting in language models. ArXiv preprint, abs/2210.03629.
  46. The unreliability of explanations in few-shot prompting for textual reasoning. Advances in neural information processing systems, 35:30378–30392.
  47. Answering questions by meta-reasoning over multiple chains of thought. ArXiv preprint, abs/2304.13007.
  48. Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27263–27277.
  49. Natural language embedded programs for hybrid language symbolic reasoning. ArXiv preprint, abs/2309.10814.
  50. Cumulative reasoning with large language models. ArXiv preprint, abs/2308.04371.
  51. Automatic chain of thought prompting in large language models. ArXiv preprint, abs/2210.03493.
  52. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. ArXiv preprint, abs/2305.03268.
  53. Enhancing zero-shot chain-of-thought reasoning in large language models through logic. ArXiv preprint, abs/2309.13339.
  54. Progressive-hint prompting improves reasoning in large language models. ArXiv preprint, abs/2304.09797.
  55. Meta-cot: Generalizable chain-of-thought prompting in mixed-task scenarios with large language models. ArXiv preprint, abs/2310.06692.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets