RARO: Relativistic Adversarial Reasoning Optimization
- RARO is a verifier-free adversarial IRL algorithm that enables LLMs to learn complex reasoning solely from expert demonstrations without explicit reward functions.
- It employs a minimax framework between a policy generator and a relativistic critic to derive implicit reward signals by comparing expert and policy-generated reasoning traces.
- Empirical evaluations show RARO improves performance by 10-15 points over behavior cloning and effectively narrows the gap with verifier-based reinforcement learning methods.
Relativistic Adversarial Reasoning Optimization (RARO) is a verifier-free adversarial inverse reinforcement learning (IRL) algorithm developed for training LLMs to perform complex reasoning solely from expert demonstrations, without requiring explicit task-specific verifiers or hand-crafted reward functions. RARO leverages an adversarial game between a policy (generator) and a relativistic critic (discriminator), enabling the model to acquire robust, scalable reasoning abilities in settings where reward signals from verifiers are absent or infeasible to construct (Cai et al., 26 Nov 2025).
1. High-Level Framework
RARO operates in environments where only expert reasoning traces are available—for example, step-by-step mathematical proofs, solution paths, or creative text outputs—while ground-truth reward feedback is not directly accessible. The method instantiates an adversarial game:
- Policy (Generator): Proposes reasoning outputs given inputs by stochastic sampling.
- Relativistic Critic (Discriminator): Assigns scalar scores to outputs, facilitating pairwise comparison between expert-generated and policy-generated reasoning traces.
The learning loop for RARO consists of iteratively:
- Sampling roll-outs from the policy and collecting corresponding expert traces.
- Updating the critic to enhance its ability to relativistically distinguish expert from policy outputs, using only the distinction “expert” vs. “policy.”
- Transforming the critic scores into an implicit reward signal for the policy.
- Updating the policy parameters using policy gradient RL to maximize these implicit rewards.
Crucially, the critic never sees a ground-truth numerical reward or verifier output, enforcing a purely adversarial IRL paradigm.
2. Minimax Objectives and Mathematical Formulation
RARO adapts the relativistic Generative Adversarial Network (GAN) objective for sequence (reasoning) generation. Let denote the expectation under the expert data distribution, and under the policy . Define the relativistic score:
Critic Objective
The critic is trained to assign higher scores to expert outputs than to policy outputs, according to:
with denoting the logistic sigmoid. The optimization is .
Generator (Policy) Objective
The generator (policy) seeks to maximize the probability that the critic scores policy samples above the average expert:
The optimization is .
Two-Time-Scale Updates (TTUR)
RARO employs separate update rates:
- Take critic update steps per policy update, with learning rate .
- Take one policy update step, with learning rate .
This ensures the critic rapidly tracks the evolving policy, maintaining stability and convergence.
3. Inverse Reinforcement Learning Interpretation
RARO can be viewed as a form of adversarial IRL:
- The critic learns an implicit reward function , where
- Once the critic reaches equilibrium, conventional policy gradient updates on recover the maximum-causal-entropy IRL update from Ho & Ermon (2016).
- Intuitively, the critic surrogates a reward function that makes expert traces prefered over policy traces, and the policy is incentivized to match expert-like reasoning under this learned reward.
4. RL Algorithms and Stabilization Techniques
Training such adversarial frameworks is generally unstable due to issues common in policy-gradient RL and GANs. RARO incorporates several stabilization innovations:
- Proximal Policy Optimization (PPO): The policy update utilizes a clipped surrogate loss, entropy regularization for exploration, and a KL penalty to maintain proximity to the initial policy if warm-started from a supervised checkpoint.
- Critic Gradient Penalty (WGAN-GP): Implements a penalty term
on interpolations between expert and policy samples, enforcing a Lipschitz constraint on the critic.
- Spectral Normalization: Applied to each transformer layer of for additional robustness.
- Two-Time-Scale Updates: Setting ensures the critic remains well-adapted to the current policy distribution.
- Reward Centering and Clipping: Subtracting a running mean and clipping the implicit reward prevents exploding policy gradients and stabilizes advantage estimation.
5. Empirical Evaluation
RARO was benchmarked on three reasoning-intensive, non-verifiable tasks:
| Task | Metric | Performance Summary |
|---|---|---|
| Countdown | % of puzzles solved exactly | +10–15 pts over BC, +8–12 over preference RL |
| DeepMath | % of theorems proved after 60s search | Scaling matches verifier-based RL, closes 80% gap |
| Poetry Writing | Human preference rate vs. supervised | 62% prefer RARO, 38% best baseline |
Key findings:
- RARO outperforms pure behavior cloning by 10–15 absolute points across all tasks.
- It exceeds generic preference-model RL baselines by 8–12 points.
- On tasks with verifiers (e.g., DeepMath, Countdown), scaling with model size and compute parallels verifier-based PPO approaches, closing most of the performance gap.
- On subjective tasks (Poetry), human raters prefer RARO outputs at a 62% rate versus 38% for the next best baseline (Cai et al., 26 Nov 2025).
6. Ablations and Critical Design Factors
Ablation studies isolated the effect of RARO's components:
| Removed Component | Observed Failure Mode |
|---|---|
| Relativistic Term | Training collapse, matches behavior cloning |
| Critic Gradient Penalty | Mode collapse, unstable losses |
| Entropy Regularization | Policy overfits, generalization fails |
| Single-Time-Scale Updates | Critic lags, learning stagnates |
| Reward Centering/Clipping | High variance, slow convergence |
All ingredients—relativistic comparison, GP-based Lipschitz enforcement, entropy regularization, and two-time-scale updates—are jointly required for robust, verifier-free training.
7. Significance and Implications
RARO demonstrates that adversarial IRL, via a learned relativistic critic, can replace traditional hand-crafted or learned verifiers for reasoning tasks. This enables reinforcement learning of sophisticated reasoning behaviors in domains where ground-truth reward functions are unavailable. Its scaling properties allow for LLMs to match the robust performance trends previously accessible only to verifier-based RL systems. A plausible implication is the generalization of RARO-style IRL to further domains where only expert demonstrations, not explicit evaluators, can be gathered, thereby broadening the applicability of advanced reasoning LLMs (Cai et al., 26 Nov 2025).