Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of LLM post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
1) What this paper is about
This paper looks at how AI models “show their work” when solving problems. Many modern LLMs are trained to write out step‑by‑step reasoning before giving an answer. The authors ask a simple but important question: when a model is rewarded only for getting the final answer right, do its written steps actually matter and can other people verify them?
They find that just rewarding correct answers often does not make the model’s written reasoning useful, trustworthy, or even connected to how it really got the answer. But they also show two easy ways to fix this.
2) The key questions
The paper focuses on three easy‑to‑grasp questions:
- Does training with “outcome-only” rewards (rewarding only the final answer) make the model’s reasoning both used and checkable?
- If we show the model a few examples of good, clear reasoning first, does that help it reason better later?
- Can we add extra rewards that directly encourage better reasoning, not just correct answers?
3) How they studied it (in everyday terms)
Think of math class:
- If a teacher only checks your final answer, you might not bother to write clear steps.
- If the teacher also checks whether your steps are used to get the answer and whether a classmate could follow them, you’ll “show your work” better.
The authors trained and tested LLMs on many small puzzle‑like tasks (from a benchmark called ReasoningGym). They used popular open models (like Qwen2.5 and Llama) and tried different training recipes.
They created two simple “fairness checks” for reasoning:
- Causal Importance of Reasoning (CIR): Do the written steps actually influence the model’s answer?
- How they test it: Imagine the model writes a long explanation. The researchers chop the explanation earlier and earlier and force the model to answer sooner. If the answer changes a lot when you cut off steps, the steps were important. If the answer barely changes, the model probably decided way before writing anything, and the steps are just decoration.
- Sufficiency of Reasoning (SR): Are the written steps clear enough that someone else can get the same answer using only those steps?
- How they test it: They give the reasoning steps to a strong “verifier” model (like a very smart student) and ask it to guess the answer twice: once with the original question and once without it. If the guess is the same both times, the written reasoning is self‑contained and clear.
They also tried:
- Training the model without any written reasoning (answer only) to see if reasoning is truly helping.
- A small amount of supervised fine‑tuning (SFT): feeding the model a handful of high‑quality “show‑your‑work” examples before the main training.
- Extra rewards that directly boost CIR or SR during training, on top of the normal “got it right” reward.
- A stricter SR check (called SR−) that removes any paraphrases of the question from the reasoning, to prevent “cheating” by just repeating the question.
4) What they found and why it matters
Here are the main results:
- Outcome-only rewards often don’t improve reasoning quality:
- Many tasks showed that models got better at producing correct answers but worse at CIR and SR. In plain terms, the model’s written steps were often not used to get the answer (low CIR) and not clear enough for someone else to follow (low SR).
- In tasks where CIR and SR stayed low, training a model to answer directly (without writing steps) did just as well. This suggests the written steps were not pulling their weight.
- Reasoning quality improved only in some cases:
- When training boosted accuracy by a huge amount (more than 50 percentage points), CIR and SR tended to improve too. But for smaller gains, the reasoning often got less useful and less verifiable.
- A small dose of good examples helps a lot:
- Showing the model a small number of expert “show‑your‑work” examples (SFT) before the main training made a big difference. Even with just a few examples, models’ reasoning became more causally important (higher CIR) and more sufficient/clear (higher SR). With more examples, both scores improved even more, and accuracy often increased too.
- Adding direct rewards for good reasoning works:
- Giving extra reward for higher CIR or higher SR during training led to better reasoning that was both used by the model and easier to verify—without hurting final answer accuracy.
- Using the stricter SR− check showed these gains weren’t from “gaming the system” by copying the question into the reasoning.
Why this matters:
- If we want to trust a model’s explanations, those explanations should be both the real reason behind its answer and clear enough for others to check. Otherwise, we risk believing nice‑sounding but unhelpful or misleading “reasoning.”
5) What this means going forward
- Don’t assume “showing your work” is trustworthy just because the model writes steps. If the model is only rewarded for being right, its steps may become fancy decoration rather than real thinking.
- If you want reliable, checkable reasoning:
- Start with a little supervised “teaching” using a few strong examples.
- Or add extra rewards that directly encourage reasoning that matters (CIR) and is sufficient for others to follow (SR).
- These changes are simple but can make models’ reasoning more faithful and easier to audit. That’s important for education, science help, and any situation where understanding how an answer was reached really matters.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The paper advances two metrics (CIR, SR) and presents evidence that outcome-only RLVR often fails to produce causally important or verifiable reasoning, with remedies via small SFT or auxiliary rewards. The following unresolved issues remain:
- External validity beyond verifiable tasks: Do the findings (low CIR/SR under outcome-only RLVR, benefits from SFT/auxiliary rewards) generalize to open-ended, non-verifiable domains (e.g., long-form QA, scientific writing) where rule-based graders are unavailable?
- Model-scale and architecture generalization: Results focus on Qwen2.5 (1.5B/3B/7B) and Llama3.2-3B. Are CIR/SR trends and interventions robust for much larger or different architectures (e.g., Mixture-of-Experts, transformer variants), and for models already trained with extensive “process” supervision?
- Task-selection bias: The 40 ReasoningGym tasks were selected for intermediate difficulty and multi-class structure. Do conclusions hold for binary tasks, extremely hard/easy tasks, and other domains (e.g., programming/code execution tasks at scale) beyond limited appendicial results?
- Verifier dependence of SR: SR relies mainly on
gpt-4o-miniwith greedy decoding and decoded-answer equality. How sensitive is SR to:- verifier model choice (e.g., GPT-4o vs smaller verifiers, open-source verifiers),
- verifier prompting,
- decoding strategy (greedy vs sampling),
- and computing SR from full probability distributions rather than decoded outputs?
- A systematic ablation is needed to establish robustness.
- Limitations of the
SRmetric:- Exact-match answer removal may miss paraphrased/implicit answer leakage.
SR-only partially addresses question leakage/paraphrase.- SR does not assess step-by-step sufficiency or local logical validity.
- Develop stronger, semantics-aware leakage removal and stepwise verifiability checks (e.g., entailment/consistency tests, neuro-symbolic verifiers) and test whether gains persist.
- Limitations of the
CIRmetric:- Reduces the answer distribution to a Bernoulli on the correct label, ignoring shifts among incorrect options in multi-class settings.
- Uses prefix truncation as a proxy for causal dependence; this is not a true causal intervention on internal computations and may confound with generation dynamics (e.g., early-answer commitment).
- Uses percentile-based approximations for efficiency; the effect on metric fidelity is not quantified.
- Validate CIR against alternative faithfulness measures (e.g., activation patching, causal mediation analysis, token ablations at semantically meaningful boundaries) and report sensitivity to computation shortcuts.
- “Forcing” early answers: CIR hinges on “forcing” the model to answer at truncated prefixes. How is this implemented across decoding policies, and how do different decoding temperatures, stopping conditions, or tag placements affect CIR?
- Reward hacking and metric gaming: Although
SR-mitigates question repetition, other gaming strategies (e.g., templated but content-free steps) may inflate SR/CIR. Construct adversarial tests to probe robustness of both metrics and the auxiliary-reward training against gaming. - Interplay of CIR and SR with human judgments: Beyond a small qualitative check, there is no large-scale human evaluation of faithfulness/verifiability. Do CIR/SR correlate with human ratings of causal faithfulness, clarity, and stepwise correctness across tasks?
- Stability and variance: The paper does not report multi-seed variability or confidence intervals. Are CIR/SR improvements (or declines) stable across random seeds, curricula, and data orders?
- Mechanistic explanation of when RLVR helps CIR/SR: The finding that CIR/SR improve only when accuracy gains exceed ~50 points lacks a mechanistic account. Which task properties (e.g., need for intermediate states, compositionality, algorithmic depth) predict when RLVR will raise or collapse CIR/SR?
- SFT data properties and generalization:
- The SFT-before-RL results use expert traces from
o3-minion 8 tasks. How do data size, trace quality/style, domain diversity, and annotation instructions affect CIR/SR and downstream performance? - Does SFT on reasoning traces transfer to unseen tasks or induce catastrophic forgetting on other abilities?
- The SFT-before-RL results use expert traces from
- Auxiliary reward scaling and scheduling: The chosen weights (
α,β) for SR/CIR rewards are tuned empirically. What are principled or adaptive strategies for balancing outcome vs process rewards over training to avoid destabilization or reward dilution? - Training algorithm and hyperparameter dependence: The study uses a specific RLVR setup (details in Appendix C). How do CIR/SR respond to different RL algorithms (e.g., PPO variants, GRPO variants, DPO-like methods), KL constraints, entropy regularization, and reward baselines?
- Compute and scalability: SR requires repeated verifier calls; CIR requires multiple evaluations per trajectory. What is the training-time cost at scale (larger models, longer traces), and can approximations or learned surrogates reduce cost without degrading metric fidelity?
- Inference-time behavior and robustness:
- Do high CIR/SR models exhibit better robustness to adversarial prompts, OOD tasks, or instruction variations?
- How do decoding choices at inference (e.g., sampling vs greedy, “no-think” modes) affect realized CIR/SR and accuracy?
- Step-level vs end-to-end evaluation: SR/CIR are end-to-end. Do conclusions hold with step-level verifiers (e.g., checking each intermediate assertion or state transition), and can training with step-level rewards yield stronger causal/verifiable traces without harming accuracy?
- Beyond ReasoningGym: The Math-Hard results are relegated to an appendix. A comprehensive cross-benchmark evaluation (math, code, symbolic reasoning, planning) is needed to confirm generality of the reported patterns.
- Impact on calibration and uncertainty: It is unclear whether higher CIR/SR improve answer calibration, selective abstention, or uncertainty estimation. Do process-aware models better know when they do not know?
- Safety and alignment implications: The paper notes risks but does not empirically study whether higher CIR/SR traces reduce deceptive or misleading rationales, or improve detectability of misalignment. Can CIR/SR be integrated into monitoring systems to flag suspect reasoning?
- Theoretical grounding of CIR/SR: The conditional independence framing is intuitive but not formally validated under the generative process of LMs. Can a theoretical analysis link CIR/SR to identifiable causal quantities or guarantees under realistic modeling assumptions?
- Pareto front of accuracy vs reasoning quality: While auxiliary rewards often maintain accuracy, trade-offs likely appear in other settings. Map the Pareto frontier of accuracy vs CIR/SR across tasks and scales to inform practical deployment choices.
Practical Applications
Practical Applications of the Paper’s Findings
This paper introduces two operational metrics—Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR)—and demonstrates simple but effective training modifications (small-n SFT before RLVR; auxiliary CIR/SR rewards) that make chain-of-thought both more faithful and more verifiable without sacrificing task accuracy. Below are concrete applications grouped by deployment horizon.
Immediate Applications
- Model evaluation and audit gates for reasoning quality
- Sectors: software, healthcare, finance, legal, education, government
- What to deploy: a lightweight evaluation suite that computes CIR (via prefix truncation and JS divergence) and SR (verifier consistency with/without question; optionally SR-) on held-out tasks; integrate into model cards and CI/CD for model releases; flag low-CIR/SR routes for human review
- Dependencies/assumptions: access to a strong verifier (API or local model); task answers must be auto-gradable or categorical; additional compute for truncation and verifier calls; privacy safeguards for chain-of-thought
- Training recipe upgrade: small-n SFT-before-RLVR
- Sectors: AI labs, enterprise AI teams (all domains using LLMs for step-by-step tasks)
- What to deploy: collect 2–64 high-quality expert traces per task; do SFT on reasoning-only (> …) before outcome-based RLVR; re-run RLVR
- Dependencies/assumptions: ability to source expert traces (teacher models or SMEs); finetuning pipeline access; adherence to data governance for traces; tasks benefit most when initial CIR/SR are low
- Reward shaping plugin: add CIR/SR auxiliary rewards to RLVR
- Sectors: software, code agents, customer support, operations planning
- What to deploy: extend RL code (e.g., TRL/GRPO variants) with weighted CIR or SR terms per rollout; percentile-based CIR approximation for efficiency; tuned weights (start around 1.0)
- Dependencies/assumptions: rollout-level access to model logits/output; verifier latency/cost; careful prevention of question leakage (use SR- to detect); limited gains on tasks where base accuracy is near zero
- Production response policy based on SR/CIR scores
- Sectors: healthcare/clinical decision support, finance/risk, legal drafting, education/tutoring
- What to deploy: runtime “trust gating”—if SR or CIR below threshold, withhold or redact chain-of-thought, add warning banner, or escalate to a human; if SR high, show “verified reasoning” badge and expose steps
- Dependencies/assumptions: calibrated thresholds per domain; regulatory review for disclosure practices; UX to communicate uncertainty; chain-of-thought privacy controls
- Cost/performance triage: disable reasoning where it doesn’t help
- Sectors: platform LLM deployments; cost-sensitive applications
- What to deploy: run A/B following the paper’s finding that outcome-only RLVR can match accuracy on some tasks; if final CIR/SR remain low, disable chain-of-thought to reduce tokens while maintaining accuracy
- Dependencies/assumptions: task-level analysis; acceptance that explanations won’t be shown when not causally used or sufficient
- Tutoring and pedagogy tooling with verifier-backed steps
- Sectors: education, corporate training
- What to deploy: tutor mode that computes SR to ensure steps are self-contained; if SR low, prompt for more concrete steps; use small-n SFT to teach models to produce concrete intermediate states
- Dependencies/assumptions: high-quality seed traces; domain-aligned verifier prompts; alignment with curricula
- Code assistant reasoning QA
- Sectors: software engineering, MLOps
- What to deploy: map SR to unit-test-based verification and CIR to plan-token truncation sensitivity; reject suggestions with low SR unless accompanied by passing tests; display “verified plan” when CIR/SR exceed threshold
- Dependencies/assumptions: robust test harnesses; repo CI integration; domain-specific verifiers (static analyzers, linters)
- Model procurement and internal governance checklists
- Sectors: government, finance, healthcare, enterprise IT
- What to deploy: add CIR/SR thresholds to RFPs and internal risk registers; require per-task reporting (accuracy, CIR, SR, SR-); treat low-CIR/high-accuracy models as “rationalizers” needing extra oversight
- Dependencies/assumptions: standard prompts and datasets for evaluation; agreements on acceptable thresholds; reproducibility across model updates
- Dataset and labeling protocols for reasoning traces
- Sectors: data vendors, ML teams
- What to deploy: collection guidelines for expert reasoning traces that avoid answer leakage; tag-only supervision (<think> tags) to teach strategy, not label leakage; quality rubric aligned to “concrete steps” and “demonstrated calculations”
- Dependencies/assumptions: annotator training; privacy handling; domain reviewers for correctness of steps
- Monitoring dashboards for “reasoning health”
- Sectors: any production LLM deployment
- What to deploy: dashboards tracking CIR, SR/SR-, reasoning length, and accuracy by task; alerts for “reasoning collapse” (falling length and SR) after fine-tune pushes
- Dependencies/assumptions: logging infra; sampling strategy; budget for periodic verifier runs
Long-Term Applications
- Regulatory standards and certifications for verifiable reasoning
- Sectors: healthcare, finance, public sector, safety-critical systems
- What it could become: industry standards that mandate minimum SR/SR- and CIR for high-stakes use; third-party audits and certifications (“Verifiable Reasoning Level 2+”)
- Dependencies/assumptions: consensus on benchmarks; regulator capacity; domain-specific verifier kits
- Domain-specialized verifier models (local, efficient)
- Sectors: healthcare (guideline-grounded), finance (policy/rulebook-grounded), law (citation-grounded)
- What it could become: compact verifier LLMs or neuro-symbolic verifiers trained to compute SR/SR- offline; logic-checked or tool-augmented evaluators
- Dependencies/assumptions: high-quality, domain-annotated traces; integration with knowledge bases; on-prem deployment constraints
- Reasoning-first architectures and objectives
- Sectors: AI research, platform model providers
- What it could become: pretraining or architectural designs that natively optimize for CIR/SR (e.g., plan-token modules, latent-to-text alignment layers, differentiable verifiers)
- Dependencies/assumptions: scalable training recipes; stable proxy rewards; benchmarks that generalize beyond math/code
- Auditable AI workflows with provenance and redaction
- Sectors: legal, compliance, enterprise risk
- What it could become: immutable logs of reasoning trajectories with automatic SR- redaction of question paraphrases and sensitive data; selective disclosure for auditors
- Dependencies/assumptions: secure logging infra; privacy-by-design; legal frameworks for handling chain-of-thought data
- Multi-agent and tool-augmented planners with SR/CIR gates
- Sectors: robotics, operations, supply chain, energy grid ops
- What it could become: agents that iteratively plan and verify substeps; proceed only when SR for each subplan exceeds a threshold; deploy truncation sensitivity (CIR) as a reliability signal before execution
- Dependencies/assumptions: task decomposers; simulators/tool APIs; latency tolerance for verification
- Education platforms with “teach-to-SR” adaptive feedback
- Sectors: education technology
- What it could become: student-facing systems that coach learners to produce sufficient reasoning (maximizing SR) and detect “shortcutting” (low CIR) in their work; analytics for instructors
- Dependencies/assumptions: secure student data handling; calibrated verifiers per subject; fairness and accessibility reviews
- Insurance, liability, and SLAs tied to reasoning metrics
- Sectors: enterprise software, AI vendors
- What it could become: contracts that guarantee minimum CIR/SR levels for certain workflows; adjusted premiums for sustained compliance
- Dependencies/assumptions: actuarial evidence linking metrics to downstream risk; standardized measurement protocols
- Cross-modal verifiable reasoning (text + code + actions + sensors)
- Sectors: software agents, robotics, scientific discovery, cyber-physical systems
- What it could become: SR-like sufficiency for plans that include code execution traces, tool outputs, and sensor data; CIR generalized to action-token truncations
- Dependencies/assumptions: unified logging across tools; trustworthy tool outputs; new multi-modal verifiers
- Marketplace of “reasoning toolchains”
- Sectors: developer platforms
- What it could become: plugins for RL frameworks offering CIR/SR rewards, SR- evaluators, data curation assistants, and dashboards; IDEs with “Reasoning Lint” that flags unverifiable steps
- Dependencies/assumptions: healthy ecosystem incentives; compatibility with popular training stacks and serving layers
- Benchmark and leaderboard ecosystems beyond math/code
- Sectors: academia, standards bodies
- What it could become: expanded ReasoningGym-like suites across operations, policy analysis, healthcare triage, and legal reasoning, with public CIR/SR leaderboards
- Dependencies/assumptions: community maintenance; robust auto-graders and reference verifiers; dataset licensing
Notes on feasibility across applications:
- The paper shows that outcome-only RLVR can raise accuracy without improving CIR/SR, so applications relying on “explanations” should not conflate accuracy with trustworthy reasoning.
- Small-n SFT is an effective and low-cost remedy; auxiliary CIR/SR rewards offer an expert-trace-free alternative but need verifier infrastructure.
- Gains are task-dependent; where base accuracy is very low, auxiliary rewards won’t fix performance but can still prevent “reasoning collapse.”
- Chain-of-thought may carry sensitive or leading information; use SR- and redaction policies to mitigate leakage risks.
Glossary
- Auxiliary reward: An additional reward signal combined with the main (outcome-based) reward during RL to shape how the model reasons. "by applying auxiliary CIR/SR rewards on top of the outcome-based reward."
- Bernoulli distribution: A probability distribution over two outcomes; used here to model answer correctness and compute divergences. "treat these scalars as parameters for Bernoulli distributions"
- Causal Importance of Reasoning (CIR): A metric quantifying how much the final answer depends on the generated reasoning tokens. "Causal Impor- tance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the fi- nal answer,"
- Chain-of-thought (CoT): Step-by-step natural language reasoning generated by a model to solve a problem. "truncate the CoT at each token tk (k € [1, T])"
- Checkpoint: A saved model state used as a starting point for subsequent training phases. "starting from the post-SFT checkpoint."
- Conditional independence: A statistical property where two variables are independent given a third; used to formalize the metrics. "We formalize these complementary requirements using conditional independence"
- GRPO: An RL optimization method proposed to improve training efficiency in verifiable domains. "Shao et al. (2024) proposed GRPO for im- proved efficiency"
- Greedy decoding: A decoding strategy that selects the highest-probability token at each step. "(In our experiments, this is based on greedy decoding.)"
- Instruction tuning: Fine-tuning a model on instruction–response pairs to improve following instructions. "they have undergone instruction tuning and are good at following instructions."
- Jensen–Shannon divergence: A symmetric divergence measure between probability distributions used to compare predictions across truncations. "compute the Jensen-Shannon divergence between them"
- Joint reward: A combined reward that includes outcome correctness and auxiliary CIR/SR signals. "This joint reward matches the accuracy of RLVR while also lead- ing to causally important and sufficient reasoning."
- Legibility (reasoning legibility): The clarity with which reasoning can be understood and independently checked. "inspired by work on reasoning legibility (Kirchner et al., 2024)."
- Neuro-symbolic: Approaches that integrate neural models with symbolic logic checks for formal validation. "Neuro-symbolic approaches translate rea- soning steps into formal statements"
- Outcome-based reward: A reward signal determined solely by final-answer correctness, not by the reasoning process. "this reward is known as an outcome-based reward."
- pass@1: An evaluation metric indicating the accuracy when only the first attempt is considered. "0.80 accuracy with pass@1."
- Policy optimization: The process in RL of adjusting a policy to maximize expected rewards. "added to the scalar reward used for policy optimization."
- Post-hoc rationales: Explanations generated after producing an answer, which may not reflect the causal process. "including post-hoc rationales (Chen et al., 2023)."
- Reinforcement Learning from Verifiable Rewards (RLVR): An RL post-training setup where rewards come from automatically checkable outcomes. "Reinforcement learning from verifiable rewards (RLVR) has become an important part of post-training efforts"
- Reward hacking: Exploiting a reward function in ways that increase reward without truly solving the intended task. "used for detecting reward hacking in math and code tasks."
- Rollout: A single sampled trajectory/output from a policy for a given prompt during RL. "We compute SR with I = 4 rollouts per prompt for efficiency."
- Rule-based function: A deterministic scorer that evaluates answer correctness according to predefined rules. "evaluated with a specific rule-based function on the answer"
- SL-before-RL: A training recipe that applies supervised learning before reinforcement learning. "known as SL-before-RL"
- Spearman correlation: A nonparametric rank correlation statistic used to assess monotonic relationships. "The Spearman correlation is in fact even stronger for SR: 0.62 (p < 0.00001)."
- Sufficiency of Reasoning (SR): A metric assessing whether the reasoning alone lets a verifier infer the correct answer. "Sufficiency of Reasoning (SR) measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone."
- Supervised fine-tuning (SFT): Further training using labeled examples, here to teach reasoning traces. "A small amount of SFT be- fore RLVR can be a remedy for low CIR and SR."
- TRACE metric: A contemporaneous metric related to CIR for assessing reasoning faithfulness. "similar to the TRACE metric in concurrent work (Wang et al., 2025)"
- Trajectory: The sequence of states/outputs constituting one episode of model reasoning and answering in RL. "computed per trajectory and added to the scalar reward used for policy optimization."
- Truncation (prefix truncation): Cutting a reasoning chain at a given token index to test causal dependence. "we define truncation at prefix length k as keeping only the first k tokens in the reasoning chain"
- Verifiable rewards: Rewards whose correctness can be independently checked by an external procedure. "environments that provide verifiable rewards support eval- uation and training loops"
- Verifier model: A separate model used to judge whether a reasoning chain suffices to determine the answer. "We use gpt-4o-mini as our verifier model."
Collections
Sign up for free to add this paper to one or more collections.