Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

133 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains (2402.00559v4)

Published 1 Feb 2024 in cs.CL

Abstract: Prompting LLMs to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question-answering settings. REVEAL includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a LLM's answer, across a variety of datasets and state-of-the-art LLMs. Evaluation on REVEAL shows that verifiers struggle at verifying reasoning chains - in particular, verifying logical correctness and detecting contradictions. Available at https://reveal-dataset.github.io/ .

References (52)

Citations (13)

View on Semantic Scholar

Summary

The paper introduces the Reveal dataset that benchmarks reasoning chain verification using 1,226 chain-of-thought answers across 817 questions.
It presents a novel formalism separating step relevance, attribution, and logical correctness, allowing fine-grained error analysis.
Baseline evaluations with models like Flan-PaLM-540B and GPT-3 reveal significant challenges in verifying the logical accuracy of reasoning steps.

A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains

This paper presents the Reveal (Reasoning Verification Evaluation) dataset, designed to benchmark the verification of reasoning chains, particularly in open-domain question-answering tasks. The authors emphasize the importance of step-by-step reasoning, commonly referred to as "Chain-of-Thought" (CoT) prompting, which is critical for complex reasoning tasks. Existing literature has focused on methods to verify reasoning accuracy, yet lacks fine-grained, step-level datasets needed for thorough evaluation of verification methods.

Dataset Overview

Reveal seeks to address this gap by providing a comprehensive dataset that annotates reasoning chains generated by leading LLMs across a variety of datasets. This dataset is constructed to evaluate the relevance, attribution to evidence, and logical correctness of each reasoning step. It comprises 817 unique questions and 1,226 CoT answers generated by three prominent LLMs, including Flan-PaLM-540B and GPT-3.

The dataset is split into two parts: Reveal-Eval, which contains high inter-annotator agreement labels, and Reveal-Open, a smaller subset containing ambiguous cases with low agreement. This delineation aids in highlighting verifier performance in straightforward and challenging scenarios.

Methodological Contributions

The paper introduces a formalism for reasoning chain verification that separates the task into different components: step relevance, the type of step (attribution and/or logical), and correctness of these attributes. This allows for granular analysis of reasoning steps, providing an instrument to detect specific points of failure and differentiating between error types.

The annotation process includes dual tasks focusing on the logical verification of reasoning progression and the attribution accuracy of factual claims. Such separation helps manage cognitive load on annotators and delivers richer data for evaluating reasoning verifiers.

Baseline Evaluations

The authors conducted extensive evaluations using state-of-the-art verifiers. They deployed few-shot prompted LMs such as Flan-UL2 and PaLM-2-L, and also other verifiers like FacTool for attribution-focused verification. Despite employing large-scale models and specialized classifiers, results indicate significant challenges remain, particularly in verifying logical correctness.

Implications and Future Directions

Reveal provides an unprecedented resource for advancing the development and evaluation of reasoning verifiers. The authors demonstrate that current models and techniques struggle with logical verification, drawing attention to the need for improving verification methods for more robust CoT reasoning.

Future work may focus on augmenting retrieval methods for better evidence-supported claim verification, developing fine-tuned models specific to logical verification of reasoning chains, and employing Reveal as a comprehensive benchmark for holistic AI system evaluations. Moreover, enhancing training strategies by incorporating step-level reasoning fidelity could elevate the performance of models in practical applications.

Overall, this work significantly contributes to the ongoing exploration of LLM reasoning and verification, setting a cornerstone for future research aiming to ensure the correctness and reliability of AI-generated reasoning processes.

PDF Markdown

Tweets

https://twitter.com/alon_jacovi/status/1753401985062117755

https://twitter.com/arxivsanitybot/status/1753598490930753983

https://twitter.com/MLexpAI/status/1754641396705919441