Papers
Topics
Authors
Recent
2000 character limit reached

RARO: Relativistic Adversarial Reasoning Optimization

Updated 23 December 2025
  • RARO is a verifier-free adversarial IRL algorithm that enables LLMs to learn complex reasoning solely from expert demonstrations without explicit reward functions.
  • It employs a minimax framework between a policy generator and a relativistic critic to derive implicit reward signals by comparing expert and policy-generated reasoning traces.
  • Empirical evaluations show RARO improves performance by 10-15 points over behavior cloning and effectively narrows the gap with verifier-based reinforcement learning methods.

Relativistic Adversarial Reasoning Optimization (RARO) is a verifier-free adversarial inverse reinforcement learning (IRL) algorithm developed for training LLMs to perform complex reasoning solely from expert demonstrations, without requiring explicit task-specific verifiers or hand-crafted reward functions. RARO leverages an adversarial game between a policy (generator) and a relativistic critic (discriminator), enabling the model to acquire robust, scalable reasoning abilities in settings where reward signals from verifiers are absent or infeasible to construct (Cai et al., 26 Nov 2025).

1. High-Level Framework

RARO operates in environments where only expert reasoning traces are available—for example, step-by-step mathematical proofs, solution paths, or creative text outputs—while ground-truth reward feedback is not directly accessible. The method instantiates an adversarial game:

  • Policy πθ\pi_\theta (Generator): Proposes reasoning outputs xx given inputs ss by stochastic sampling.
  • Relativistic Critic DϕD_\phi (Discriminator): Assigns scalar scores to outputs, facilitating pairwise comparison between expert-generated and policy-generated reasoning traces.

The learning loop for RARO consists of iteratively:

  • Sampling roll-outs from the policy and collecting corresponding expert traces.
  • Updating the critic to enhance its ability to relativistically distinguish expert from policy outputs, using only the distinction “expert” vs. “policy.”
  • Transforming the critic scores into an implicit reward signal rϕ(x)r_\phi(x) for the policy.
  • Updating the policy parameters using policy gradient RL to maximize these implicit rewards.

Crucially, the critic never sees a ground-truth numerical reward or verifier output, enforcing a purely adversarial IRL paradigm.

2. Minimax Objectives and Mathematical Formulation

RARO adapts the relativistic Generative Adversarial Network (GAN) objective for sequence (reasoning) generation. Let EE denote the expectation under the expert data distribution, and PP under the policy πθ\pi_\theta. Define the relativistic score:

ΔDϕ(x,x)=Dϕ(x)Dϕ(x)\Delta D_\phi(x, x') = D_\phi(x) - D_\phi(x')

Critic Objective

The critic DϕD_\phi is trained to assign higher scores to expert outputs than to policy outputs, according to:

LD(ϕ;θ)=ExEE[logσ(Dϕ(xE)Exπθ[Dϕ(x)])]Exπθ[log(1σ(Dϕ(x)ExEE[Dϕ(xE)]))]L_D(\phi; \theta) = -\mathbb{E}_{x^E \sim E} \left[ \log \sigma(D_\phi(x^E) - \mathbb{E}_{x \sim \pi_\theta}[D_\phi(x)]) \right] - \mathbb{E}_{x \sim \pi_\theta} \left[ \log(1 - \sigma(D_\phi(x) - \mathbb{E}_{x^E \sim E}[D_\phi(x^E)])) \right]

with σ()\sigma(\cdot) denoting the logistic sigmoid. The optimization is maxϕLD(ϕ;θ)\max_\phi L_D(\phi; \theta).

Generator (Policy) Objective

The generator (policy) seeks to maximize the probability that the critic scores policy samples above the average expert:

LG(θ;ϕ)=Exπθ[logσ(Dϕ(x)ExEE[Dϕ(xE)])]L_G(\theta; \phi) = -\mathbb{E}_{x \sim \pi_\theta} \left[ \log \sigma(D_\phi(x) - \mathbb{E}_{x^E \sim E}[D_\phi(x^E)]) \right]

The optimization is minθLG(θ;ϕ)\min_\theta L_G(\theta; \phi).

Two-Time-Scale Updates (TTUR)

RARO employs separate update rates:

  • Take kDk_D critic update steps per policy update, with learning rate αD\alpha_D.
  • Take one policy update step, with learning rate αθ\alpha_\theta.

This ensures the critic rapidly tracks the evolving policy, maintaining stability and convergence.

3. Inverse Reinforcement Learning Interpretation

RARO can be viewed as a form of adversarial IRL:

  • The critic learns an implicit reward function rϕ(x)r_\phi(x), where

rϕ(x)Dϕ(x)Exπθ[Dϕ(x)]r_\phi(x) \approx D_\phi(x) - \mathbb{E}_{x \sim \pi_\theta}[D_\phi(x)]

  • Once the critic reaches equilibrium, conventional policy gradient updates on rϕ(x)r_\phi(x) recover the maximum-causal-entropy IRL update from Ho & Ermon (2016).
  • Intuitively, the critic surrogates a reward function that makes expert traces prefered over policy traces, and the policy is incentivized to match expert-like reasoning under this learned reward.

4. RL Algorithms and Stabilization Techniques

Training such adversarial frameworks is generally unstable due to issues common in policy-gradient RL and GANs. RARO incorporates several stabilization innovations:

λEx^[(x^Dϕ(x^)21)2]\lambda \,\mathbb{E}_{\hat{x}} \left[(\|\nabla_{\hat{x}} D_\phi(\hat{x})\|_2 - 1)^2\right]

on interpolations x^\hat{x} between expert and policy samples, enforcing a Lipschitz constraint on the critic.

  • Spectral Normalization: Applied to each transformer layer of DϕD_\phi for additional robustness.
  • Two-Time-Scale Updates: Setting αDαθ\alpha_D \gg \alpha_\theta ensures the critic remains well-adapted to the current policy distribution.
  • Reward Centering and Clipping: Subtracting a running mean and clipping the implicit reward rϕ(x)r_\phi(x) prevents exploding policy gradients and stabilizes advantage estimation.

5. Empirical Evaluation

RARO was benchmarked on three reasoning-intensive, non-verifiable tasks:

Task Metric Performance Summary
Countdown % of puzzles solved exactly +10–15 pts over BC, +8–12 over preference RL
DeepMath % of theorems proved after 60s search Scaling matches verifier-based RL, closes 80% gap
Poetry Writing Human preference rate vs. supervised 62% prefer RARO, 38% best baseline

Key findings:

  • RARO outperforms pure behavior cloning by 10–15 absolute points across all tasks.
  • It exceeds generic preference-model RL baselines by 8–12 points.
  • On tasks with verifiers (e.g., DeepMath, Countdown), scaling with model size and compute parallels verifier-based PPO approaches, closing most of the performance gap.
  • On subjective tasks (Poetry), human raters prefer RARO outputs at a 62% rate versus 38% for the next best baseline (Cai et al., 26 Nov 2025).

6. Ablations and Critical Design Factors

Ablation studies isolated the effect of RARO's components:

Removed Component Observed Failure Mode
Relativistic Term Training collapse, matches behavior cloning
Critic Gradient Penalty Mode collapse, unstable losses
Entropy Regularization Policy overfits, generalization fails
Single-Time-Scale Updates Critic lags, learning stagnates
Reward Centering/Clipping High variance, slow convergence

All ingredients—relativistic comparison, GP-based Lipschitz enforcement, entropy regularization, and two-time-scale updates—are jointly required for robust, verifier-free training.

7. Significance and Implications

RARO demonstrates that adversarial IRL, via a learned relativistic critic, can replace traditional hand-crafted or learned verifiers for reasoning tasks. This enables reinforcement learning of sophisticated reasoning behaviors in domains where ground-truth reward functions are unavailable. Its scaling properties allow for LLMs to match the robust performance trends previously accessible only to verifier-based RL systems. A plausible implication is the generalization of RARO-style IRL to further domains where only expert demonstrations, not explicit evaluators, can be gathered, thereby broadening the applicability of advanced reasoning LLMs (Cai et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Relativistic Adversarial Reasoning Optimization (RARO).