Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains (2402.00559v4)

Published 1 Feb 2024 in cs.CL

Abstract: Prompting LLMs to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question-answering settings. REVEAL includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a LLM's answer, across a variety of datasets and state-of-the-art LLMs. Evaluation on REVEAL shows that verifiers struggle at verifying reasoning chains - in particular, verifying logical correctness and detecting contradictions. Available at https://reveal-dataset.github.io/ .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Palm 2 technical report.
  2. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
  3. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  4. Language models are few-shot learners. CoRR, abs/2005.14165.
  5. Reconcile: Round-table conference improves reasoning via consensus among diverse llms.
  6. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
  7. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics, 9:447–461.
  8. Palm: Scaling language modeling with pathways.
  9. Scaling instruction-finetuned language models.
  10. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  11. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classification and Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, volume 3944 of Lecture Notes in Computer Science, pages 177–190. Springer.
  12. Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  13. RARR: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
  14. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics (TACL).
  15. ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  16. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
  17. Measuring mathematical problem solving with the math dataset. ArXiv, abs/2103.03874.
  18. TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905–3920, Seattle, United States. Association for Computational Linguistics.
  19. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models.
  20. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks.
  21. A comprehensive evaluation of tool-assisted generation strategies. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13856–13878, Singapore. Association for Computational Linguistics.
  22. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1266–1279, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  23. How much coffee was consumed during EMNLP 2019? fermi problems: A new reasoning challenge for AI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7318–7328, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  24. Scitail: A textual entailment dataset from science question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
  25. Towards automated factchecking: Developing an annotation schema and benchmark for consistent automated claim detection. CoRR, abs/1809.08193.
  26. Tell me why! explanations support learning relational and causal structure. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 11868–11890. PMLR.
  27. Towards explainable evaluation metrics for natural language generation. ArXiv, abs/2203.11131.
  28. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333, Toronto, Canada. Association for Computational Linguistics.
  29. Let’s verify step by step.
  30. What ingredients make for an effective crowdsourcing protocol for difficult NLU data collection tasks? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1221–1235, Online. Association for Computational Linguistics.
  31. Large dual encoders are generalizable retrievers.
  32. Juri Opitz and Anette Frank. 2021. Towards a decomposable metric for explainable evaluation of text generation from AMR. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1504–1518, Online. Association for Computational Linguistics.
  33. Thoughtsource: A central hub for large language model reasoning data.
  34. True few-shot learning with language models. CoRR, abs/2105.11447.
  35. ReCEval: Evaluating reasoning chains via correctness and informativeness. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10066–10086, Singapore. Association for Computational Linguistics.
  36. Measuring and narrowing the compositionality gap in language models.
  37. Measuring attribution in natural language generation models. CoRR, abs/2112.12870.
  38. Why don’t you do it right? analysing annotators’ disagreement in subjective tasks. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2428–2441, Dubrovnik, Croatia. Association for Computational Linguistics.
  39. Get your vitamin C! robust fact verification with contrastive evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 624–643, Online. Association for Computational Linguistics.
  40. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
  41. Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 641–651, New Orleans, Louisiana. Association for Computational Linguistics.
  42. Ul2: Unifying language learning paradigms.
  43. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  44. MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics.
  45. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  46. Chain-of-thought prompting elicits reasoning in large language models.
  47. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287–302.
  48. Naturalprover: Grounded mathematical proof generation with language models. ArXiv, abs/2205.12910.
  49. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  50. Microsoft cambridge at trec-13: Web and hard tracks. In IN PROCEEDINGS OF TREC 2004.
  51. Star: Bootstrapping reasoning with reasoning.
  52. How language model hallucinations can snowball.
Citations (13)

Summary

  • The paper introduces the Reveal dataset that benchmarks reasoning chain verification using 1,226 chain-of-thought answers across 817 questions.
  • It presents a novel formalism separating step relevance, attribution, and logical correctness, allowing fine-grained error analysis.
  • Baseline evaluations with models like Flan-PaLM-540B and GPT-3 reveal significant challenges in verifying the logical accuracy of reasoning steps.

This paper presents the Reveal (Reasoning Verification Evaluation) dataset, designed to benchmark the verification of reasoning chains, particularly in open-domain question-answering tasks. The authors emphasize the importance of step-by-step reasoning, commonly referred to as "Chain-of-Thought" (CoT) prompting, which is critical for complex reasoning tasks. Existing literature has focused on methods to verify reasoning accuracy, yet lacks fine-grained, step-level datasets needed for thorough evaluation of verification methods.

Dataset Overview

Reveal seeks to address this gap by providing a comprehensive dataset that annotates reasoning chains generated by leading LLMs across a variety of datasets. This dataset is constructed to evaluate the relevance, attribution to evidence, and logical correctness of each reasoning step. It comprises 817 unique questions and 1,226 CoT answers generated by three prominent LLMs, including Flan-PaLM-540B and GPT-3.

The dataset is split into two parts: Reveal-Eval, which contains high inter-annotator agreement labels, and Reveal-Open, a smaller subset containing ambiguous cases with low agreement. This delineation aids in highlighting verifier performance in straightforward and challenging scenarios.

Methodological Contributions

The paper introduces a formalism for reasoning chain verification that separates the task into different components: step relevance, the type of step (attribution and/or logical), and correctness of these attributes. This allows for granular analysis of reasoning steps, providing an instrument to detect specific points of failure and differentiating between error types.

The annotation process includes dual tasks focusing on the logical verification of reasoning progression and the attribution accuracy of factual claims. Such separation helps manage cognitive load on annotators and delivers richer data for evaluating reasoning verifiers.

Baseline Evaluations

The authors conducted extensive evaluations using state-of-the-art verifiers. They deployed few-shot prompted LMs such as Flan-UL2 and PaLM-2-L, and also other verifiers like FacTool for attribution-focused verification. Despite employing large-scale models and specialized classifiers, results indicate significant challenges remain, particularly in verifying logical correctness.

Implications and Future Directions

Reveal provides an unprecedented resource for advancing the development and evaluation of reasoning verifiers. The authors demonstrate that current models and techniques struggle with logical verification, drawing attention to the need for improving verification methods for more robust CoT reasoning.

Future work may focus on augmenting retrieval methods for better evidence-supported claim verification, developing fine-tuned models specific to logical verification of reasoning chains, and employing Reveal as a comprehensive benchmark for holistic AI system evaluations. Moreover, enhancing training strategies by incorporating step-level reasoning fidelity could elevate the performance of models in practical applications.

Overall, this work significantly contributes to the ongoing exploration of LLM reasoning and verification, setting a cornerstone for future research aiming to ensure the correctness and reliability of AI-generated reasoning processes.