Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

Published 12 Apr 2026 in cs.LG, cs.AI, and cs.CL | (2604.10701v1)

Abstract: Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern LLM RL because conventional discriminative critics are difficult to train reliably. We revisit value modeling and argue that this difficulty is partly due to limited expressiveness. In particular, representation complexity theory suggests that value functions can be hard to approximate under the one-shot prediction paradigm used by existing value models, and our scaling experiments show that such critics do not improve reliably with scale. Motivated by this observation, we propose Generative Actor-Critic (GenAC), which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing a value estimate. We further introduce In-Context Conditioning, which helps the critic remain calibrated to the current actor throughout training. GenAC improves value approximation, ranking reliability, and out-of-distribution generalization, and these gains translate into stronger downstream RL performance than both value-based and value-free baselines. Overall, our results suggest that stronger value modeling is a promising direction for improving credit assignment in LLM reinforcement learning.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents GenAC, a generative critic framework that replaces scalar value prediction with chain-of-thought reasoning to improve credit assignment in LLM RL.
It demonstrates that generative critics reduce approximation error and variance, achieving higher sample efficiency and validation accuracy on mathematical reasoning benchmarks.
The study illustrates enhanced interpretability and robustness through In-Context Conditioning, offering precise token-level credit assignment for complex, long-horizon tasks.

Generative Critics for Value Modeling in LLM Reinforcement Learning

Motivation and Problem Statement

The landscape of LLM reinforcement learning is shaped by the need for effective credit assignment. Modern RL-centric LLM post-training, across alignment, mathematical reasoning, and code generation, is constrained by the inability of prevalent methods to provide informative policy gradients over long-horizon trajectories. Value-free methods like GRPO and RLOO, popular for computational simplicity, are fundamentally limited—they estimate uniform trajectory-level advantage, essentially reducing sequence modeling to bandit feedback. This results in poor credit assignment for complex, multi-step tasks. While traditional actor-critic methods such as PPO once addressed this through token-level advantage estimation, their practical impact has diminished at LLM scale due to unreliable value modeling.

This work challenges the notion that value-based approaches are intractable in LLM RL, analyzing the expressivity failure of conventional discriminative critics and proposing an alternative: Generative Actor-Critic (GenAC). GenAC replaces scalar value prediction with chain-of-thought generative value estimation, bolstered by In-Context Conditioning (ICC) for robust policy-awareness throughout training.

Theoretical Limits of Discriminative Critics

The central claim advanced is that the difficulty in value modeling for LLM RL is not purely an optimization artifact, but a representational bottleneck driven by model architecture choices. Discriminative critics, limited by one-shot scalar prediction, cannot express the function complexity required for value estimation in language MDPs. Grounding this, the paper references representation complexity theory, demonstrating that the computation for optimal value functions in MDPs indexed by polynomial-time languages can be $\mathsf{P}$ -complete, while constant-depth Transformers are confined to $\mathsf{TC^0}$ [feng2024rethinking]. As a result, scaling model parameters does not yield improved value function approximation if the model cannot perform serial reasoning steps.

Empirical scaling studies substantiate this: discriminative critics (initialized from variously sized LLMs) show stable but high approximation error regardless of parameter count and high seed sensitivity, confirming the theoretical analysis.

Figure 1: (a) Generative critics outperform discriminative critics in value approximation and scale better with model size. (b) GenAC yields stronger sample efficiency and converges to higher downstream RL validation accuracy on mathematical reasoning benchmarks.

The GenAC Architecture

GenAC's principal innovation is a generative critic that leverages chain-of-thought reasoning prior to emitting a final value estimate. By exploiting the LLM’s natural capacity for autoregressive reasoning, the generative critic can decompose the evaluation problem into sequential subcomputations, addressing expressiveness shortcomings inherent to shallow, single-pass architectures.

In contrast to replacing the LM head with a single-value regressor (standard in discriminative critics), GenAC retains the generative LM head and prompts the model to output a structured reasoning trace terminating with an integer-valued "score" (0–10, normalized), naturally suited for LLMs.

To enhance the policy-awareness of the critic and prevent miscalibration or reward hacking, GenAC introduces the ICC mechanism. The ICC prompt explicitly inserts a summary of the current actor’s policy traits (e.g., model size, recent performance statistics) as context for the critic. This ensures that the value estimate is tied to the evolving actor and not a static or off-policy assessment.

Figure 2: GenAC’s ICC-based prompt template provides explicit policy context to the generative critic.

Training Pipeline and Cost Analysis

The generative critic undergoes multi-stage training: (1) supervised fine-tuning on synthetic reasoning traces, (2) RL-based value pretraining using empirical Monte Carlo returns, and (3) joint actor-critic optimization via PPO. During RL, segment-level value predictions from the critic are mapped to token-level advantages using GAE with $\lambda=1$ , achieving densely informative gradients for the actor.

The cost analysis highlights that GenAC’s computational overhead (~ $2.1\times$ PPO’s FLOPs) is moderate compared to rollout-based approaches (e.g., VinePPO at $4.3\times$ ). The efficiency is achieved by generating relatively short reasoning traces per prefix (rather than multiple full rollouts per segment, as in VinePPO).

Empirical Results and Analysis

Value Approximation and Stability

GenAC's critics exhibit superior value approximation compared to discriminative baselines, as seen in markedly reduced validation MSE and decreased variance with increasing model size. This is further validated through ranking performance probes: generative critics achieve higher top-1 accuracy in selecting the highest-value candidate among sampled segments, with this gap widening as the candidate set grows.

Downstream RL Policy Performance

On six mathematical reasoning benchmarks, GenAC outperforms both value-free (GRPO, RLOO) and value-based (VC-PPO) baselines in final policy accuracy and sample efficiency. VC-PPO trails off as training progresses, illustrating saturation due to critic expressiveness limits, while GenAC maintains performance improvements throughout.

Figure 3: GenAC assigns token-level advantage reflecting actual reasoning dynamics, whereas value-free and discriminative critic methods fail to differentiate segmental quality, especially on erroneous answer trajectories.

Out-of-Distribution Generalization

Testing critics on datasets with progressive distributional shifts reveals that generative critics consistently generalize better, with higher MSE reductions on OOD evaluation sets compared to discriminative critics. This robustness underlies the sustained RL performance advantage as the actor policy diverges during training.

Interpretability and Credit Assignment

Case studies show that GenAC’s generative critic can pinpoint the precise step in a reasoning trajectory where a conceptual error emerges, assigning appropriately lowered value and providing interpretable rationale.

Figure 4: Example where the generative critic accurately detects and rationalizes a stepwise conceptual error in mathematical reasoning.

Ablations and Design Insights

Ablations disabling ICC or skipping RL-based critic training deliver degraded value approximation and RL performance. Notably, even highly capable LLMs (e.g., GPT-5) fail as value critics when used in a zero-shot, prompt-based manner, confirming the necessity of policy-aware, specialized value modeling.

Implications and Future Directions

The findings directly challenge the recent trend away from value-based RL in LLMs, establishing that architectural limitations of discriminative critics—rather than inherent infeasibility—explain degraded performance at scale. Introducing generative critics with chain-of-thought and policy conditioning enables reliable, expressive value modeling, unlocking fine-grained credit assignment critical for complex LLM behaviors.

The theoretical implication is that for MDPs induced by long-horizon, language-based tasks, only value functions that match the sequence-processing demands of the problem class can be learned. Practically, these results anticipate that the RL fine-tuning of next-generation LLMs in agentic, reasoning-intensive settings will benefit from a renewed emphasis on value modeling, provided that the critic architectures are sufficiently expressive.

Ongoing engineering work is required to optimize the computational footprint of generative critics. Further research should extend the empirical study beyond mathematical reasoning to domains involving multi-agent, interactive, and multi-modal LLM settings, where precise temporal credit assignment is likely to be even more crucial.

Conclusion

This work repositions explicit value modeling at the core of RL for LLMs, diagnosing the expressiveness limitations of classical discriminative critics and demonstrating the effectiveness of generative, chain-of-thought critics with ICC. The resulting GenAC framework achieves higher sample efficiency, stability, and final performance in mathematical reasoning tasks, with robust generalization and interpretable error attribution. These results suggest that scalable, expressive value modeling will be foundational for future advances in RL-driven LLMs as tasks grow in complexity and temporal horizon.

Markdown Report Issue