Escaping the Verifier: Learning to Reason via Demonstrations

Published 26 Nov 2025 in cs.LG and cs.AI | (2511.21667v1)

Abstract: Training LLMs to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator): the policy learns to mimic expert answers, while the critic learns to compare and distinguish between policy and expert answers. Our method trains both the policy and the critic jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL on verifiable tasks. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.

Abstract PDF Chat (Pro)

Summary

The paper demonstrates that RARO leverages an adversarial IRL framework with a relativistic critic to train LLMs solely from expert demonstrations.
It introduces joint optimization and stabilization techniques to overcome mode collapse and catastrophic forgetting in sequential reasoning tasks.
Experimental results show that RARO outperforms imitation learning baselines on both verifiable and creative tasks without needing reward oracles.

RARO: Relativistic Adversarial Reasoning Optimization for Verifier-Free Learning

Motivation and Problem Formulation

Most advanced LLMs trained for complex reasoning tasks currently rely on reinforcement learning with task-specific, domain-bound verifiers to supply reward signals. However, many high-value real-world tasks—especially those with open-ended or creative outputs—lack such verifiers, limiting the applicability of RL protocols and leaving abundant expert demonstrations under-exploited. Existing approaches either require a reward oracle or use imitation learning, which struggles to generalize in sequential, reasoning-intensive domains due to covariate shift and compounding errors. This work addresses the challenge of obtaining robust reasoning performance in domains where only demonstrations are available, and no reward-verifier exists.

Methodology: Relativistic Adversarial Reasoning Optimization (RARO)

RARO leverages Inverse Reinforcement Learning (IRL) with a relativistic adversarial formulation inspired by GAN-driven imitation learning, introducing several advances tailored for reasoning with LLMs:

Architecture: RARO establishes an adversarial, game-theoretic structure between a policy (generator) and a relativistic critic (discriminator). The policy optimizes to produce answers indistinguishable from expert demonstrations, while the critic is trained to assess and distinguish the relative quality of policy and expert responses.
Relativistic Critic: Unlike classical binary discriminators, the critic evaluates pairs of expert and policy outputs, computing a relativistic score that reflects preference for one over the other rather than absolute correctness. This comparison-based setup is empirically more robust in reasoning tasks.
Joint Optimization: Both the policy and relativistic critic are trained continually and jointly, within an RL loop. The critic's relativistic advantage function augments the reward landscape and stabilizes updates, mitigating mode collapse and catastrophic forgetting—the persistent failures of prior adversarial IRL approaches in sequential prediction.
Stabilization Techniques: RARO incorporates spectral normalization, two-time-scale update rules, and other techniques established in GAN literature to ensure convergence and stable adversarial training.

RARO thus generalizes successful adversarial learning protocols to the sequential, structured domains that characterize reasoning, where traditional GAN or IRL algorithms are brittle.

Experimental Analysis and Numerical Evaluation

RARO is evaluated on three tasks spanning both verifiable and non-verifiable reasoning:

Countdown: Multi-step arithmetic with long-horizon dependencies.
DeepMath: Open-ended mathematical problem solving.
Poetry Writing: Creative, non-verifiable generation.

Key results include:

Strong Outperformance of Baselines: RARO achieves higher accuracy, diversity, and faithfulness compared to supervised fine-tuning (SFT), imitation learning, and alternative GAN- or IRL-inspired imitation approaches, particularly in long-horizon or creative tasks.
Scaling Trends: RARO shows scaling curves in line with RL on tasks with available verifiers, indicating that reasoning improvement is sustained as the number of expert demonstrations and policy updates increase—an outcome not previously established in verifier-free RL/IL scenarios.
No Verifier Required: On tasks such as poetry writing, where reward oracles are demonstrably unavailable, RARO still attains meaningful improvements over SFT.
Robustness: The method is substantially less prone to catastrophic forgetting and mode collapse, empirically verified through training stability diagnostics.
Comparison to Recent Work: Benchmarks against "Reinforcing General Reasoning without Verifiers" (Zhou et al., 27 May 2025), direct preference optimization (DPO) (Rafailov et al., 2023), and related IRL approaches show consistently better sample efficiency and generalization from demonstrations.

Theoretical and Practical Implications

RARO demonstrates that adversarial IRL with relativistic critics and joint optimization constitutes a viable and performant protocol for robust reasoning acquisition solely from demonstrations, without reward engineering or verifiers. This represents a significant methodological pivot for the LLM post-training landscape, directly addressing the practical limitations of reward-driven and verifier-centric RL algorithms. The formal equivalence between learning via relativistic critics and preference-based reward models further connects recent developments in human preference optimization and practical deployment of reasoning-centric LLMs.

Practically, RARO opens up new application avenues for LLMs in scientific discovery, the arts, open-ended dialog, and other fields where human-like reasoning is required but automatic verification is absent. The scalability of RARO makes it promising for future LLM pretraining and continual learning regimes that must generalize across both verifiable and non-verifiable settings.

Speculations on Future Research

Scaling to Web-Scale Demonstrations: With proper stabilizations, RARO can leverage massive, unlabeled corpora of expert output, suggesting a direction for hybrid SFT-IRL protocols at internet scale.
Extensions to Multimodal Reasoning: Adapting relativistic adversarial critics to vision-language and multimodal LLMs could further generalize the approach.
Preference Modeling: Integration with reward learning from human preferences or via collective knowledge distillation could further enhance reasoning fidelity.
Combination with Verifier-Aided RL: RARO can complement, rather than replace, verifier-dependent RL by bootstrapping on tasks lacking reward signals, then fine-tuning where verifiers emerge.

Conclusion

RARO establishes a new state of the art for verifier-free, reasoning-focused LLM training from demonstrations. Its robust adversarial IRL architecture, relativistic critic design, and empirical superiority over imitation or preference-only protocols mark a distinct advance toward generalizable, demonstration-driven LLM reasoning. This work provides both theoretical justification and practical, scalable methods for reasoning without the need for verifiers, with extensive implications for future LLM alignment, creativity, and robust deployment across domains lacking reward oracles (2511.21667).