V-STaR: Training Verifiers for Self-Taught Reasoners (2402.06457v2)

Published 9 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Common self-improvement approaches for LLMs, such as STaR, iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

PDF Abstract

V-STaR: Enhanced Self-Improvement for LLMs through Verification

Introduction

The quest to enhance the reasoning capabilities of LLMs has prompted a variety of approaches, centering primarily on self-improvement methods which iteratively fine-tune models on their self-generated outcomes. Traditional techniques in this domain, however, have a notable limitation in that they leverage only correct model-generated solutions for further training, neglecting a substantial body of potentially instructive incorrect solutions. Addressing this gap, the paper introduces V-STaR (Verification for Self-Taught Reasoners), a novel framework that harnesses both correct and incorrect solutions generated by LLMs to train a verifier. This verifier is then employed during inference to adjudicate among multiple candidate solutions, thereby refining the model's problem-solving acuity over iterations.

Methodology

V-STaR operates on the insight that incorrect solutions, much like correct ones, contain valuable data points for model refinement. To capitalize on this concept, the framework employs Dual Policy Optimization (DPO) as its core, training both a generator model on correct solutions and a verifier model on all generated solutions, tagged with their correctness.

In V-STaR's iterative process, the generator model proposes solutions for a given problem, from which the verifier model learns by comparing correct solutions against incorrect ones. This comparison not only enriches the training data but also evolves the verifier's capability to discern solution accuracy more finely. The paper underscores the effectiveness of DPO over other verification approaches by demonstrating stronger numerical results in terms of verifier performance.

Empirical Analysis

The paper presents empirical validation of V-STaR's efficacy across several benchmarks, reporting improvements ranging from 4% to 17% in test accuracy over existing self-improvement and verification methods on tasks related to code generation and math reasoning. This uplift is attributed to V-STaR's dual mechanism which not only enhances the generative model but also refines the verifier by incorporating insights drawn from incorrect solutions. Additionally, models fine-tuned with V-STaR are observed to rival or even surpass larger models in performance, indicating significant efficiency gains.

Implications

From a theoretical perspective, V-STaR's approach prompts a reevaluation of error-handling in LLM training, suggesting that incorrect outputs, when used judiciously, can substantially contribute to model improvement. Practically, the framework offers a scalable and efficient method for enhancing LLMs' reasoning abilities without the need for extensive new datasets, but rather by extracting more value from existing data through nuanced verification.

Future Directions

The paper posits several avenues for future research stemming from V-STaR's findings. One such direction is the exploration of V-STaR's applicability across a wider spectrum of tasks, particularly those where correctness verification is challenging but possible. Another is the refinement of the DPO mechanism to further harness the instructional value of incorrect solutions. Additionally, examining the impact of V-STaR in a multi-modal context, where reasoning extends beyond text to include visual or auditory data, presents a promising frontier.

In sum, V-STaR introduces a compelling enhancement to the self-improvement paradigm for LLMs, leveraging the unexploited potential of incorrect solutions through an iterative verification mechanism. Its success heralds a nuanced approach to model training, with implications that stretch across the theoretical and practical realms of AI research.