SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning (2504.19162v2)

Published 27 Apr 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Evaluating the step-by-step reliability of LLM reasoning, such as Chain-of-Thought, remains challenging due to the difficulty and cost of obtaining high-quality step-level supervision. In this paper, we introduce Self-Play Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-level annotation. SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" that deliberately produces erroneous steps designed to be difficult to detect, and a "critic" that analyzes the correctness of reasoning steps. These two models engage in an adversarial game in which the generator aims to fool the critic, while the critic model seeks to identify the generator's errors. Using reinforcement learning based on the game outcomes, the models iteratively improve; the winner of each confrontation receives a positive reward and the loser receives a negative reward, driving continuous self-evolution. Experiments on three reasoning process benchmarks (ProcessBench, PRM800K, DeltaBench) demonstrate that our SPC progressively enhances its error detection capabilities (e.g., accuracy increases from 70.8% to 77.7% on ProcessBench) and surpasses strong baselines, including distilled R1 model. Furthermore, SPC can guide the test-time search of diverse LLMs and significantly improve their mathematical reasoning performance on MATH500 and AIME2024, surpassing those guided by state-of-the-art process reward models.

Summary

Essay on "SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning"

This paper presents a novel approach to evaluating the stepwise reliability of LLMs in reasoning tasks using the Self-Play Critic (SPC) framework. The paper addresses a significant challenge in the field: the difficulty and cost of obtaining high-quality step-level supervision for reasoning processes like the Chain-of-Thought.

The proposed method, SPC, employs an adversarial self-play mechanism to facilitate the evolution of a critic model capable of assessing reasoning steps. The framework involves fine-tuning two instances of a base model with distinct roles: a "sneaky generator" responsible for creating subtle and erroneous reasoning steps, and a "critic" tasked with identifying these errors. The interaction between these entities forms an adversarial game where the generator aims to produce errors that the critic might fail to detect, while the critic continually improves its ability to catch these errors.

Through iterative reinforcement learning driven by the outcomes of these games, both models self-evolve, improving their respective capabilities. The SPC framework utilizes a reward-based system; successful detection by the critic results in rewards, whereas undetected errors reward the generator. This dynamic enhances the critic's effectiveness over multiple iterations without the need for manual annotations.

The empirical validation of SPC on three reasoning benchmarks—ProcessBench, PRM800K, and DeltaBench—demonstrates progressive enhancement in error detection. Results indicate that the SPC framework boosts accuracy rates significantly; for instance, accuracy on ProcessBench improved from 70.8% to 77.7% after iterative training. Additionally, implementing SPC in diverse LLMs significantly enhanced mathematical reasoning performance on benchmark datasets such as MATH500 and AIME2024, achieving superior results compared to state-of-the-art process reward models.

The implications of this research are multifaceted, offering both practical and theoretical advancements. Practically, SPC enhances LLM reasoning performance by guiding reasoning processes in real-time, effectively reducing errors during model inference. Theoretically, the adversarial self-play model represents a shift away from traditional static datasets toward more dynamic, self-improving training methodologies, which could inspire further innovations in AI and machine learning.

Future developments may focus on refining the self-evolution capabilities of SPC under different task domains, enhancing model robustness against adversarial attacks, and exploring deeper integrations with more sophisticated LLMs. Moreover, the modular nature of the SPC framework suggests potential for adaptation and extension into other domains requiring rigorous stepwise verification.

In conclusion, by introducing a self-evolutionary framework for reasoning error detection in LLMs, the paper contributes significantly to the advancement of AI reasoning systems, pushing the boundaries of automated reasoning processes through innovative methodologies. The iterative self-improvement showcased by SPC sets a precedent for future research, encouraging the exploration of similar self-learning paradigms across AI disciplines.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (8)

Tweets

https://twitter.com/iScienceLuvr/status/1917138498555121803

https://twitter.com/ceobillionaire/status/1917223944106778715

https://twitter.com/fly51fly/status/1917336104849793511