Natural Language Actor-Critic (NLAC)

Updated 5 December 2025

NLAC is a framework that integrates large language models as both actor and critic, enabling natural language feedback and value estimation for sequential decision-making.
It combines token-logit, binary verification, and natural language critique methods to improve long-horizon reasoning and robustness in complex tasks.
The approach enhances sample efficiency and policy stability through iterative refinement and off-policy data, demonstrating superior performance in various empirical evaluations.

Natural Language Actor-Critic (NLAC) refers to a family of actor-critic architectures and algorithms where both actor and critic interact and reason in natural language, leveraging the capabilities of LLMs for sequential decision-making and structured prediction. In NLAC, the critic can produce either scalar value estimates or natural language critiques, and the framework encompasses approaches for both black-box inference-time improvement and trainable, data-driven policy learning. Core variants include token-logit–based critic evaluation, binary verification, and direct natural language feedback, each with distinct optimization and applicability properties (Dong et al., 4 Jun 2025, Zheng et al., 28 Oct 2024, Hong et al., 4 Dec 2025, Bahdanau et al., 2016).

1. Conceptual Foundations and Motivation

Classic auto-regressive LLMs generate actions (outputs, tokens, SQL, or commands) by conditioning on the immediately preceding context, optimizing for local likelihood but lacking explicit mechanisms for global planning and long-term reward maximization. Even advanced prompting strategies such as Chain-of-Thought are limited: they enable better local reasoning but do not adjust the policy itself in response to eventual success or failure (Dong et al., 4 Jun 2025).

Actor-critic frameworks remedy this by introducing an explicit "critic" that evaluates (state, action) pairs for long-term desirability. In NLAC, both actor and critic are implemented via LLMs—usually with carefully crafted prompts to elicit value prediction, outcome verification, or rich natural language feedback.

Motivations for NLAC include:

Long-horizon reasoning: Environments where reward manifests only after many steps (e.g., ALFWorld, WebShop).
Large, open-ended action spaces: Tool use, code generation, web navigation, and dialogue.
Sample efficiency and stability: Black-box approaches allow prompt-based policy improvement without the variance and instability of gradient-based RL.
Richer feedback signals: Language-based critiques as opposed to simple scalars permit refinement in settings where exploration by random action sampling is infeasible.

2. NLAC Architectures: Actor and Critic Roles

There are three principal instantiations of NLAC in the literature.

Token-Logit Q-value NLAC

Actor ( $\pi_0(a \mid s)$ ): Standard LLM next-action distribution in a given context (state $s$ = initial goal + action history).
Critic ( $Q(s, a)$ ): Q-value extracted via token logits. Special outcome tokens ("GOOD" $y_w$ , "BAD" $y_l$ ) are appended to the prompt; the log-odds between these tokens give an estimate of long-term return:

$Q(s, a) = \log \frac{P_{\mathrm{LLM}}(y_w \mid s, a)}{P_{\mathrm{LLM}}(y_l \mid s, a)}$

Rollouts and "reflection" (LLM-predicted future trajectories and meta-evaluations) further refine $Q(s, a, u)$ for anticipated delayed effects (Dong et al., 4 Jun 2025).

Binary Critique NLAC

Actor: LLM prompted to generate a discrete solution (e.g., Text-to-SQL mapping).
Critic: Binary signal (True/False), determined via database execution (syntax and semantic checks) or LLM-based prompts. Two error rates are relevant: false-negative ( $q$ ) and false-positive ( $s$ ). The process iterates: if the Critic rejects the result, the Actor re-generates a new candidate. Performance guarantees derive from these error rates and the number of allowed iterations ( $z$ ):

$\kappa_\infty = \frac{p(1-s)}{p + q - pq - ps}$

where $p$ is Actor accuracy (Zheng et al., 28 Oct 2024).

Natural Language Critique NLAC

Actor ( $\pi_\theta$ ): Policy LLM prompted to output a thought-action pair per time step (e.g., in ReAct format).
Language Critic ( $Q_L^\pi$ ): Separately prompted generative LLM that predicts forecasts (future rollout descriptions) and aggregates these into textual critiques assessing action optimality and providing revision guidance.
Refinement policy ( $\pi^r_\theta$ ): LLM prompt that, given state, action, and critic text, proposes improved actions. The base actor is trained to imitate the refinement policy via maximum-likelihood distillation on off-policy data (Hong et al., 4 Dec 2025).

Variant	Actor	Critic Type
Token-logit Q-value (Dong et al., 4 Jun 2025)	LLM prior $\pi_0(a\|s)$	Scalar Q via token logits
Binary AC-SQL (Zheng et al., 28 Oct 2024)	LLM, black-box generator	True/False, execution+LLM
NL critique (Hong et al., 4 Dec 2025)	LLM (ReAct or similar)	Free-form NL critique

3. Policy Improvement and Training Methodologies

Token-logit NLAC

Policy improvement is formulated as a KL-constrained optimization:

$\max_\pi \mathbb{E}_{a \sim \pi(\cdot|s)}\bigl[\,Q(s,a,u)\bigr] - \frac{1}{\alpha} D_{\rm KL}[\pi(\cdot|s)\|\pi_0(\cdot|s)]$

with analytic, gradient-free update:

$\pi^*(a|s) \propto \pi_0(a|s) \exp(\alpha Q(s,a,u))$

In practice, $n$ candidate actions are sampled from $\pi_0$ , weighted via $\exp(\alpha Q)$ , and the action maximizing this weighted score is selected. No gradient passes through the LLM (Dong et al., 4 Jun 2025).

Binary Critique NLAC

The process iterates:

Actor generates candidate,
Critic accepts (True) or rejects (False). If rejected and under iteration budget, resample; otherwise, output latest candidate. No policy gradients or parameter updates, so both LLMs remain black boxes. Accuracy increases with critic reliability and iteration budget (Zheng et al., 28 Oct 2024).

NL Critique and Distillation NLAC

Critic (language successor model) training: One-step language Bellman KL,

$L_1(s_t, a_t, r_t, s_{t+1}) = D_{\rm KL}\left(\mathcal{B}_L M_{\bar{\theta}}(\cdot|s_t,a_t) \| M_\theta(\cdot|s_t,a_t)\right)$

Policy refinement and distillation: Define $\pi^r_\theta$ to generate $a^r_t$ such that $Q_L^\theta(s_t, a^r_t) \geq Q_L^\theta(s_t, a_t)$ , then update $\pi_\theta$ :

$L_2(s_t) = -\mathbb{E}_{a^r_t \sim \pi^r_\theta}[ \log \pi_\theta(a^r_t | s_t) ]$

The overall update is

$\theta \leftarrow \theta - \lambda_1 \nabla_\theta L_1 - \lambda_2 \nabla_\theta L_2$

This approach is sample-efficient, off-policy, and does not require policy gradients or importance sampling (Hong et al., 4 Dec 2025).

4. Empirical Evaluations and Performance

Significant empirical results have demonstrated the effectiveness of NLAC variants across environments and natural language tasks.

Token-logit NLAC: On ALFWorld, BabyAI-Text, and WebShop, NLAC outperforms ReAct, RAP, RAFA, ICPI, LATS, and even GPT-4+ReAct baselines, raising success rates by 10–20%. Ablations show performance drops of 20–30 points when rollout or reflection is removed (Dong et al., 4 Jun 2025).
AC-SQL (Binary Critique): Evaluated zero-shot on Spider-dev, Spider-DK, and Spider-SYN. AC raises execution accuracy by up to 35 points (e.g., LLaMA3-8B: 32.6% → 67.7%, Vicuna-33B: 43.7% → 61.0%). Consistent improvements hold across diverse LLMs, including GPT-4o. Theoretical performance matches empirical outcomes (Zheng et al., 28 Oct 2024).
NL Critique NLAC: Demonstrates improved performance over PPO, GRPO, and scalar SAC. On MATH500-Hard, 20Q, and dialogue+tool tasks, NLAC converges in half the steps and yields higher win rates or task success rates (e.g., MATH500-Hard: 60.2% vs PPO≈52.3%). Sample efficiency and stability are notable, with learning stable on off-policy data, even in long-horizon, sparse-reward domains (Hong et al., 4 Dec 2025).
Sequence Prediction NLAC: Yields BLEU improvements of +1.7 to +2.3 over maximum-likelihood baselines and outperforms REINFORCE on translation and synthetic sequence correction, validating the framework’s applicability to standard NLG tasks (Bahdanau et al., 2016).

5. Generalization, Transfer, and Theoretical Guarantees

The AC paradigm generalizes across natural language and structured prediction settings where:

The Actor proposes a candidate solution (text, program, SQL, code).
The Critic batch-verifies or scores candidate solutions—via execution, logical checkers, or LLM-based evaluations.
An iterative process (with bounded trials) can increase the probability of correct output, as long as the sum of critic error rates (FPR+FNR) does not exceed 1 (Zheng et al., 28 Oct 2024).

Theoretically, NLAC–AC methods guarantee non-decreasing expected accuracy under sufficiently reliable critics. Token-logit and natural language critique approaches yield empirical robustness in multi-step and open-ended environments, and natural language feedback enables improvement in cases where scalar-reward exploration would be intractable (Hong et al., 4 Dec 2025, Dong et al., 4 Jun 2025).

6. Limitations and Future Directions

Current NLAC variants exhibit several limitations:

Reflection timing: Token-logit NLAC applies reflection only pre-action. Introducing post-action reflection or tree-search strategies could improve critic accuracy and robustness (Dong et al., 4 Jun 2025).
Continuous reward modeling: Existing heuristics for continuous rewards are limited. Designing richer outcome token schemes or hybrid scalar + language critics remains open (Dong et al., 4 Jun 2025, Hong et al., 4 Dec 2025).
Scope of critique: Natural language critics’ quality depends on the ability to generate actionable, accurate feedback—a challenge in domains where progress is not readily verbalizable (Hong et al., 4 Dec 2025).

Future research directions include applying deeper search over trajectories, exploring scalable LLM fine-tuning for critic and actor roles in complex domains (e.g., robotics simulators), and systematically integrating multi-modal feedback signals.

7. Relationship to Broader RL and Sequence Prediction

NLAC extends classic actor-critic and RL techniques to structured prediction in language, explicitly narrowing the train/test gap present in maximum-likelihood or teacher-forcing regimes. Conditioning the critic on ground truth (during training) and using policy-induced rollouts exposes the actor to its own errors, regularizing and robustifying policy learning (Bahdanau et al., 2016). By incorporating rich critic signals at inference or during training, NLAC frameworks serve as general, extensible templates for reinforcement, verification, and bootstrapped improvement in LLM-driven natural language tasks.