SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling (2501.19306v3)

Published 31 Jan 2025 in cs.AI and cs.CL

Abstract: Recent advancements in LLMs have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing parallel scaling methods, such as repeated sampling or reward model scoring, often suffer from premature convergence and high costs due to task-specific reward model training, while sequential methods like SELF-REFINE cannot effectively leverage increased compute. This paper introduces Self-Enhanced Test-Time Scaling (SETS), a new approach that overcomes these limitations by strategically combining parallel and sequential techniques. SETS exploits the inherent self-verification and self-correction capabilities of LLMs, unifying sampling, verification, and correction within a single framework. This innovative design facilitates efficient and scalable test-time computation for enhanced performance on complex tasks. Our comprehensive experimental results on challenging benchmarks spanning planning, reasoning, math, and coding demonstrate that SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.

Summary

The paper introduces SETS (Self-Enhanced Test-Time Scaling), a framework that improves LLM test-time performance by integrating sampling, self-verification, and self-correction.
Experimental results show SETS achieving accuracy gains up to 8.7% over baselines and demonstrating better test-time scaling properties on complex reasoning benchmarks.
SETS enhances confidence calibration and leverages the inherent self-correction capabilities of advanced LLMs without requiring task-specific model training.

The paper "SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling" introduces a novel methodology, Self-Enhanced Test-Time Scaling (SETS), to enhance the performance of LLMs during inference by leveraging their self-verification and self-correction capabilities. This approach aims to solve the limitations of traditional test-time strategies such as repeated sampling and reward model scoring, which suffer from diminishing returns with increased computation and require costly task-specific model training.

Key Methodology:

SETS Framework:
- The SETS methodology integrates three core operations: Sampling, Self-Verification, and Self-Correction into one cohesive framework:
  - Sampling: Generates initial candidate solutions using LLMs.
  - Self-Verification: Evaluates whether a proposed solution satisfies task constraints without external models.
  - Self-Correction: Iteratively refines solutions based on verification feedback to improve accuracy.
- The framework utilizes these operations in a multi-round process, where the solution is repeatedly verified and corrected until a satisfactory solution is reached.
Scaling Laws and Trade-offs:
- The SETS approach is formulated to identify optimal test-time compute allocations, balancing model size and additional sample generations to maximize performance.
- Prior approaches show limitations where extra computational effort in sampling does not proportionately increase performance due to inherent inefficiencies in majority voting or reliance on external verifiers.

Experimental Evaluation:

The SETS framework was evaluated using benchmarks that test complex planning and reasoning capabilities, such as NATURAL PLAN (Trip Planning, Meeting Planning, and Calendar Scheduling) and LiveBench Reasoning.
Results demonstrated that SETS achieved substantial accuracy improvements of up to 8.7% over baselines and displayed more favorable test-time scaling properties. This is particularly evident in tasks with a demanding solution space where self-correction capabilities of recent LLMs like GEMINI-1.5 models are effectively utilized.
While outperforming conventional sampling baselines, SETS also shows a significant advantage in accuracy gains for challenging tasks, evidencing its effective search capability for correct solutions.

Additional Contributions:

Beyond accuracy, SETS improves confidence calibration of predictive outputs by refining the self-consistency approach, translating to up to 9% AUACC improvement and a marked reduction in Expected Calibration Error (ECE).
The research underscores the key advantage of advanced LLMs which, through inherent self-correction, can bypass additional model fine-tuning for improved test-time performance.
A thorough analysis provided insights into dynamic trade-offs between increasing self-verification/correction rounds versus the number of samples.

In summary, the paper makes a strong case for SETS as an effective strategy in the regime of test-time compute scaling through integrated self-verification and correction, ultimately yielding improved generative accuracy and scaling laws. This positions SETS as a promising method for enhancing LLM utility in real-world problem-solving applications.

PDF Markdown

SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling (2501.19306v3)

Summary

Key Methodology:

Experimental Evaluation:

Additional Contributions:

Related Papers