RARO: Relativistic Adversarial Reasoning Optimization

Updated 23 December 2025

RARO is a verifier-free adversarial IRL algorithm that enables LLMs to learn complex reasoning solely from expert demonstrations without explicit reward functions.
It employs a minimax framework between a policy generator and a relativistic critic to derive implicit reward signals by comparing expert and policy-generated reasoning traces.
Empirical evaluations show RARO improves performance by 10-15 points over behavior cloning and effectively narrows the gap with verifier-based reinforcement learning methods.

Relativistic Adversarial Reasoning Optimization (RARO) is a verifier-free adversarial inverse reinforcement learning (IRL) algorithm developed for training LLMs to perform complex reasoning solely from expert demonstrations, without requiring explicit task-specific verifiers or hand-crafted reward functions. RARO leverages an adversarial game between a policy (generator) and a relativistic critic (discriminator), enabling the model to acquire robust, scalable reasoning abilities in settings where reward signals from verifiers are absent or infeasible to construct (Cai et al., 26 Nov 2025).

1. High-Level Framework

RARO operates in environments where only expert reasoning traces are available—for example, step-by-step mathematical proofs, solution paths, or creative text outputs—while ground-truth reward feedback is not directly accessible. The method instantiates an adversarial game:

Policy $\pi_\theta$ (Generator): Proposes reasoning outputs $x$ given inputs $s$ by stochastic sampling.
Relativistic Critic $D_\phi$ (Discriminator): Assigns scalar scores to outputs, facilitating pairwise comparison between expert-generated and policy-generated reasoning traces.

The learning loop for RARO consists of iteratively:

Sampling roll-outs from the policy and collecting corresponding expert traces.
Updating the critic to enhance its ability to relativistically distinguish expert from policy outputs, using only the distinction “expert” vs. “policy.”
Transforming the critic scores into an implicit reward signal $r_\phi(x)$ for the policy.
Updating the policy parameters using policy gradient RL to maximize these implicit rewards.

Crucially, the critic never sees a ground-truth numerical reward or verifier output, enforcing a purely adversarial IRL paradigm.

2. Minimax Objectives and Mathematical Formulation

RARO adapts the relativistic Generative Adversarial Network (GAN) objective for sequence (reasoning) generation. Let $E$ denote the expectation under the expert data distribution, and $P$ under the policy $\pi_\theta$ . Define the relativistic score:

$\Delta D_\phi(x, x') = D_\phi(x) - D_\phi(x')$

Critic Objective

The critic $D_\phi$ is trained to assign higher scores to expert outputs than to policy outputs, according to:

$L_D(\phi; \theta) = -\mathbb{E}_{x^E \sim E} \left[ \log \sigma(D_\phi(x^E) - \mathbb{E}_{x \sim \pi_\theta}[D_\phi(x)]) \right] - \mathbb{E}_{x \sim \pi_\theta} \left[ \log(1 - \sigma(D_\phi(x) - \mathbb{E}_{x^E \sim E}[D_\phi(x^E)])) \right]$

with $\sigma(\cdot)$ denoting the logistic sigmoid. The optimization is $\max_\phi L_D(\phi; \theta)$ .

Generator (Policy) Objective

The generator (policy) seeks to maximize the probability that the critic scores policy samples above the average expert:

$L_G(\theta; \phi) = -\mathbb{E}_{x \sim \pi_\theta} \left[ \log \sigma(D_\phi(x) - \mathbb{E}_{x^E \sim E}[D_\phi(x^E)]) \right]$

The optimization is $\min_\theta L_G(\theta; \phi)$ .

Two-Time-Scale Updates (TTUR)

RARO employs separate update rates:

Take $k_D$ critic update steps per policy update, with learning rate $\alpha_D$ .
Take one policy update step, with learning rate $\alpha_\theta$ .

This ensures the critic rapidly tracks the evolving policy, maintaining stability and convergence.

3. Inverse Reinforcement Learning Interpretation

RARO can be viewed as a form of adversarial IRL:

The critic learns an implicit reward function $r_\phi(x)$ , where

$r_\phi(x) \approx D_\phi(x) - \mathbb{E}_{x \sim \pi_\theta}[D_\phi(x)]$

Once the critic reaches equilibrium, conventional policy gradient updates on $r_\phi(x)$ recover the maximum-causal-entropy IRL update from Ho & Ermon (2016).
Intuitively, the critic surrogates a reward function that makes expert traces prefered over policy traces, and the policy is incentivized to match expert-like reasoning under this learned reward.

4. RL Algorithms and Stabilization Techniques

Training such adversarial frameworks is generally unstable due to issues common in policy-gradient RL and GANs. RARO incorporates several stabilization innovations:

Proximal Policy Optimization (PPO): The policy update utilizes a clipped surrogate loss, entropy regularization $H(\pi)$ for exploration, and a KL penalty to maintain proximity to the initial policy if warm-started from a supervised checkpoint.
Critic Gradient Penalty (WGAN-GP): Implements a penalty term

$\lambda \,\mathbb{E}_{\hat{x}} \left[(\|\nabla_{\hat{x}} D_\phi(\hat{x})\|_2 - 1)^2\right]$

on interpolations $\hat{x}$ between expert and policy samples, enforcing a Lipschitz constraint on the critic.

Spectral Normalization: Applied to each transformer layer of $D_\phi$ for additional robustness.
Two-Time-Scale Updates: Setting $\alpha_D \gg \alpha_\theta$ ensures the critic remains well-adapted to the current policy distribution.
Reward Centering and Clipping: Subtracting a running mean and clipping the implicit reward $r_\phi(x)$ prevents exploding policy gradients and stabilizes advantage estimation.

5. Empirical Evaluation

RARO was benchmarked on three reasoning-intensive, non-verifiable tasks:

Task	Metric	Performance Summary
Countdown	% of puzzles solved exactly	+10–15 pts over BC, +8–12 over preference RL
DeepMath	% of theorems proved after 60s search	Scaling matches verifier-based RL, closes 80% gap
Poetry Writing	Human preference rate vs. supervised	62% prefer RARO, 38% best baseline

Key findings:

RARO outperforms pure behavior cloning by 10–15 absolute points across all tasks.
It exceeds generic preference-model RL baselines by 8–12 points.
On tasks with verifiers (e.g., DeepMath, Countdown), scaling with model size and compute parallels verifier-based PPO approaches, closing most of the performance gap.
On subjective tasks (Poetry), human raters prefer RARO outputs at a 62% rate versus 38% for the next best baseline (Cai et al., 26 Nov 2025).

6. Ablations and Critical Design Factors

Ablation studies isolated the effect of RARO's components:

Removed Component	Observed Failure Mode
Relativistic Term	Training collapse, matches behavior cloning
Critic Gradient Penalty	Mode collapse, unstable losses
Entropy Regularization	Policy overfits, generalization fails
Single-Time-Scale Updates	Critic lags, learning stagnates
Reward Centering/Clipping	High variance, slow convergence

All ingredients—relativistic comparison, GP-based Lipschitz enforcement, entropy regularization, and two-time-scale updates—are jointly required for robust, verifier-free training.

7. Significance and Implications

RARO demonstrates that adversarial IRL, via a learned relativistic critic, can replace traditional hand-crafted or learned verifiers for reasoning tasks. This enables reinforcement learning of sophisticated reasoning behaviors in domains where ground-truth reward functions are unavailable. Its scaling properties allow for LLMs to match the robust performance trends previously accessible only to verifier-based RL systems. A plausible implication is the generalization of RARO-style IRL to further domains where only expert demonstrations, not explicit evaluators, can be gathered, thereby broadening the applicability of advanced reasoning LLMs (Cai et al., 26 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Escaping the Verifier: Learning to Reason via Demonstrations (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Relativistic Adversarial Reasoning Optimization (RARO).

RARO: Relativistic Adversarial Reasoning Optimization

1. High-Level Framework

2. Minimax Objectives and Mathematical Formulation

3. Inverse Reinforcement Learning Interpretation

4. RL Algorithms and Stabilization Techniques

5. Empirical Evaluation

6. Ablations and Critical Design Factors

7. Significance and Implications

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

RARO: Relativistic Adversarial Reasoning Optimization

1. High-Level Framework

2. Minimax Objectives and Mathematical Formulation

3. Inverse Reinforcement Learning Interpretation

4. RL Algorithms and Stabilization Techniques

5. Empirical Evaluation

6. Ablations and Critical Design Factors

7. Significance and Implications

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research