AutoPSV: Automated Process-Supervised Verifier (2405.16802v4)

Published 27 May 2024 in cs.CL and cs.LG

Abstract: In this work, we propose a novel method named \textbf{Auto}mated \textbf{P}rocess-\textbf{S}upervised \textbf{V}erifier (\textbf{\textsc{AutoPSV}}) to enhance the reasoning capabilities of LLMs by automatically annotating the reasoning steps. \textsc{AutoPSV} begins by training a verification model on the correctness of final answers, enabling it to generate automatic process annotations. This verification model assigns a confidence score to each reasoning step, indicating the probability of arriving at the correct final answer from that point onward. We detect relative changes in the verification's confidence scores across reasoning steps to automatically annotate the reasoning process, enabling error detection even in scenarios where ground truth answers are unavailable. This alleviates the need for numerous manual annotations or the high computational costs associated with model-induced annotation approaches. We experimentally validate that the step-level confidence changes learned by the verification model trained on the final answer correctness can effectively identify errors in the reasoning steps. We demonstrate that the verification model, when trained on process annotations generated by \textsc{AutoPSV}, exhibits improved performance in selecting correct answers from multiple LLM-generated outputs. Notably, we achieve substantial improvements across five datasets in mathematics and commonsense reasoning. The source code of \textsc{AutoPSV} is available at \url{https://github.com/rookie-joe/AutoPSV}.

References (63)

Citations (6)

View on Semantic Scholar

Summary

The paper presents a novel methodology where a verification model employs confidence scores to automatically annotate reasoning steps.
It leverages relative confidence variations between consecutive steps to identify errors and optimize both annotation accuracy and computational efficiency.
Experimental results across GSM8K, HellaSwag, and Winogrande benchmarks demonstrate significant improvements in LLM performance.

Enhancing Reasoning in LLMs with AutoCV

The paper "AutoCV: Empowering Reasoning with Automated Process Labeling via Confidence Variation" presents a novel methodology to enhance the reasoning capabilities of LLMs by utilizing automated process labeling. In contrast to traditional approaches such as model-induced annotation methods or manual annotation, AutoCV employs a verification model trained on final answers' correctness to provide automated annotations in the reasoning process, thus optimizing both the accuracy and computational efficiency of these models.

Methodological Framework

The foundation of the AutoCV approach is rooted in leveraging a verification model to infer a confidence score for each reasoning step. This score represents the likelihood of arriving at a correct final answer, enabling the identification of errors in reasoning steps by examining relative changes in confidence across these steps. The automation of annotation through this method reduces the dependence on extensive manual annotations or computationally intensive sampling methods integral to model-induced annotation strategies.

AutoCV consists of several notable components:

Outcome-Supervised Verification: Initially, a verification model is trained through outcome supervision using annotations based solely on the correctness of final answers. The verification model assigns confidence scores that estimate the probability of ultimately reaching a correct answer from any given reasoning step.
Confidence Variation Detection: AutoCV calculates the relative variation in confidence scores between consecutive reasoning steps. This metric serves as a basis for annotating correct or incorrect intermediate steps, facilitating an automated labeling process.
Training Verification Models: Utilizing process annotations generated via confidence variations, AutoCV bolsters the training of process-supervised enhanced verification models. These models improve the LLM's capacity to discern correct answers from among multiple candidates generated during inference.

Experimental Validation and Results

AutoCV was validated across multiple datasets, covering both mathematical and commonsense reasoning tasks, to ascertain its effectiveness and scalability. The experimental results are compelling, demonstrating significant accuracy improvements over baseline models such as Self-Consistency and exclusively outcome-supervised verifiers. Notably, the AutoCV method enhanced performance across the GSM8K dataset and other commonsense reasoning benchmarks, including the HellaSwag and Winogrande datasets. This evidences the approach's ability to enhance reasoning in varied cognitive domains, thus affirming the robustness of the generated process annotations.

Implications and Future Directions

This paper successfully integrates process supervision advantages with outcome supervision benefits by employing AutoCV's novel labeling technique. The methodology's implication is particularly significant in scenarios demanding extensive reasoning, where manual supervision becomes strenuous and computational resources are a premium. Genomic sequence analysis, drug discovery, and decision-making frameworks in autonomous systems are potential domains of application.

AutoCV's impact goes beyond enhancing current models, paving the way for future developments in LLMs by illustrating a method that optimizes the balance between computational efficiency and annotation accuracy. Future research could explore refining process annotations in more granular steps or real-time scenario application, further employing AutoCV as a step towards an ensemble-like structure of verification models, leading to improved resilience and precision in LLM outputs.

In conclusion, AutoCV represents a significant advance in LLMs' reasoning capabilities, fostering both theoretical insights and practical efficiencies in automated process labeling. As the field of artificial intelligence continues to thrive, methodologies like AutoCV will be instrumental in the drive towards more reliable, interpretable, and capable LLMs across diverse applications.