Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision (2509.14234v1)

Published 17 Sep 2025 in cs.LG

Abstract: Where do learning signals come from when there is no ground truth in post-training? We propose turning exploration into supervision through Compute as Teacher (CaT), which converts the model's own exploration at inference-time into reference-free supervision by synthesizing a single reference from a group of parallel rollouts and then optimizing toward it. Concretely, the current policy produces a group of rollouts; a frozen anchor (the initial policy) reconciles omissions and contradictions to estimate a reference, turning extra inference-time compute into a teacher signal. We turn this into rewards in two regimes: (i) verifiable tasks use programmatic equivalence on final answers; (ii) non-verifiable tasks use self-proposed rubrics-binary, auditable criteria scored by an independent LLM judge, with reward given by the fraction satisfied. Unlike selection methods (best-of-N, majority, perplexity, or judge scores), synthesis may disagree with the majority and be correct even when all rollouts are wrong; performance scales with the number of rollouts. As a test-time procedure, CaT improves Gemma 3 4B, Qwen 3 4B, and Llama 3.1 8B (up to +27% on MATH-500; +12% on HealthBench). With reinforcement learning (CaT-RL), we obtain further gains (up to +33% and +30%), with the trained policy surpassing the initial teacher signal.

Summary

  • The paper introduces CaT, a novel method that synthesizes reference answers from diverse rollouts as a teacher signal without human annotation.
  • The approach employs both inference-time optimization and an RL loop (CaT-RL), achieving up to 33% gains on MATH-500 and 30% on HealthBench.
  • It uses rubric-based reward construction to validate model outputs, outperforming selection-based baselines and even matching expert human rubrics.

Compute as Teacher: Reference-Free Supervision via Inference Compute

Introduction and Motivation

The paper "Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision" (2509.14234) introduces a method, CaT, for post-training LLMs in domains where ground-truth references or programmatic verifiers are unavailable. The central premise is to leverage inference-time compute—specifically, the diversity in parallel rollouts generated by the current policy—to synthesize a reference answer using a frozen anchor model. This synthesized reference is then used as a teacher signal for further optimization, either at inference or within an RL loop (CaT-RL). The approach is designed to be drop-in, requiring no human annotation and minimal domain-specific engineering. Figure 1

Figure 1: CaT pipeline: exploration via parallel rollouts, synthesis of a reference by a frozen anchor, and reward conversion for verifiable and non-verifiable domains.

Methodology

Synthesis of Reference Answers

For each prompt qq, the current policy πt\pi_t generates GG parallel rollouts o1:Go_{1:G}. The anchor policy π0\pi_0 (frozen, typically the initial policy) is conditioned only on these rollouts (not the prompt) and tasked with synthesizing a single reference answer ss that reconciles omissions, contradictions, and partial solutions. This synthesis is not a selection among rollouts but a generative reconciliation, enabling the anchor to produce answers that may disagree with all rollouts and even correct errors present in every sample.

Reward Construction

  • Verifiable Tasks (e.g., Math): The synthesized reference ss is used as a target, and rewards are assigned via programmatic equivalence checks (e.g., string match of boxed answers).
  • Non-Verifiable Tasks (e.g., Clinical Guidance): The anchor generates a response-specific rubric R\mathcal{R}, a set of binary, auditable criteria. An independent LLM judge πJ\pi_J evaluates whether each rollout satisfies each criterion, and the reward is the fraction of criteria met. Figure 2

    Figure 2: Rubric-based rewards for non-verifiable tasks: anchor generates rubrics, judge LLM scores each criterion, reward is normalized proportion satisfied.

Training Regimes

  • Inference-Time CaT: The anchor synthesizes a reference from rollouts at test time, improving output quality without weight updates.
  • CaT-RL: The synthesized reference or rubric-based reward is used within a GRPO RL loop to update the policy, enabling further improvement beyond the initial teacher signal.

Empirical Results

Performance Gains

CaT and CaT-RL were evaluated on MATH-500 (verifiable) and HealthBench (non-verifiable) using Gemma 3 4B, Qwen 3 4B, and Llama 3.1 8B. CaT at inference yields up to +27% improvement on MATH-500 and +12.5% on HealthBench. CaT-RL further improves performance, achieving up to +33% and +30% relative gains, with the trained policy often surpassing the initial teacher signal. Figure 3

Figure 3: CaT and CaT-RL improve models by up to \sim30% relative to the initial policy baseline.

Rubric Rewards vs. Alternatives

Self-proposed rubrics outperform model-as-judge semantic equivalence and are competitive with expert human rubrics, demonstrating the efficacy of decomposed, auditable criteria for reward construction in non-verifiable domains. Figure 4

Figure 4

Figure 4: Left: Self-proposed rubrics rival expert rubrics. Right: RL with rubrics outperforms SFT on non-verifiable tasks.

Comparison to Selection Baselines

CaT outperforms single-sample, best-of-NN, minimum perplexity, mutual predictability, and majority vote baselines, both in verifiable and non-verifiable domains. Notably, CaT can produce correct answers that disagree with all rollouts, a capability unattainable by selection-based methods. Figure 5

Figure 5: CaT at inference outperforms all selection-based baselines, with substantial improvements on both HealthBench and MATH-500.

Scaling with Rollout Count

Performance scales monotonically with the number of rollouts GG in verifiable domains, plateauing in non-verifiable domains after G4G \approx 4. This scaling property enables a practical trade-off between FLOPs and supervision signal strength. Figure 6

Figure 6

Figure 6: Left: CaT scales with rollout count. Right: CaT leverages cross-rollout reasoning rather than acting as a new rollout.

Mechanistic Insights

Analysis shows that CaT does not simply act as another rollout; it leverages the diversity and reasoning present in the set of rollouts, reconciling disagreements and synthesizing improved answers. CaT disagrees with majority voting on 14% of questions and with all rollouts on 1%, indicating genuine reconciliation rather than selection.

Limitations and Future Directions

CaT's effectiveness depends on the anchor's ability to synthesize meaningful references; weak base models or domains with insufficient rollout diversity may limit improvement. As the policy converges and rollout diversity decreases, the teacher signal plateaus, bounding further gains. Future work may focus on promoting rollout diversity via exploration rewards or sampling strategies, extending synthesis to reasoning traces, and automating question generation to eliminate reliance on curated datasets. Figure 7

Figure 7: The teacher signal from CaT-RL synthesis converges with the trained policy as rollout diversity diminishes.

Practical and Theoretical Implications

CaT provides a scalable, annotation-free supervision mechanism for both verifiable and non-verifiable domains, addressing a major bottleneck in specialized LLM development. The method generalizes self-training and knowledge distillation by synthesizing references from model exploration rather than relying on single self-labels or consensus. The use of self-proposed rubrics for reward construction in non-verifiable domains offers a robust alternative to judge-only feedback, mitigating issues of instability and bias.

Theoretically, CaT demonstrates that inference compute can be systematically converted into supervision, suggesting a pathway toward reference-free, potentially superhuman model capabilities unconstrained by human-annotated data. The approach bridges RL, self-training, and ensemble error correction, and opens avenues for further research in reference estimation strategies and reward shaping.

Conclusion

Compute as Teacher (CaT) establishes a principled framework for turning inference compute into reference-free supervision by synthesizing reference answers from parallel rollouts and converting them into rewards for both verifiable and non-verifiable tasks. Empirical results show substantial improvements over baselines and demonstrate the superiority of synthesis over selection. The method is practical, scalable, and broadly applicable, with implications for the future of LLM post-training and the development of models in domains where human supervision is scarce or contested.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Overview

This paper introduces a simple idea called Compute as Teacher (CaT). It asks: if we don’t have correct answers to train an AI, can we use extra “thinking” at test time to create our own teaching signal? The authors show how to turn the AI’s own multiple tries on a question into a better, combined answer, and then use that as feedback to improve the AI.

Goals and Questions

The paper explores a few easy-to-understand questions:

  • Can extra “thinking” by the AI (more tries per question) be turned into a teacher-like signal when no official correct answer exists?
  • Can this work for both clear-cut tasks (like math, where you can check the final answer) and messy tasks (like health advice, where there isn’t just one correct answer)?
  • Is building a new combined answer from many tries better than simply picking the most common or most confident try?
  • Do these teacher signals help the AI learn over time, not just answer better right now?

How the Method Works (in everyday language)

Think of the AI as a group of students tackling the same question. Each student gives a different attempt. Then a trusted moderator reads all attempts and writes a single, improved solution. That improved solution becomes the “teacher signal.”

Step 1: Explore (many tries)

For each question, the AI generates several answers in parallel (like multiple drafts). These are called “rollouts.”

Step 2: Synthesize (combine ideas)

A stable “anchor” model (an earlier, frozen version of the AI) reads only the set of attempts, not the original question, and writes a single synthesized answer. It:

  • Combines useful bits from different attempts
  • Fixes mistakes and contradictions
  • Can disagree with the majority if the majority is wrong This is different from “selection” methods (like picking the most common answer); here the anchor builds a new, better answer.

Turn the synthesized answer into “points” (rewards)

To help the AI learn, the paper turns the synthesized answer into a reward score:

  • For verifiable tasks (like math): the AI gets a point if its final answer matches the synthesized answer (checked by a simple program).
  • For non-verifiable tasks (like health advice): the anchor writes a short checklist (rubric) of yes/no criteria describing what a good answer should include (e.g., “mentions risks,” “is polite,” “gives clear steps”). A separate judge model marks each item as yes or no. The reward is the fraction of checklist items satisfied. This breaks a vague “good/bad” judgment into small, auditable pieces.

Two ways to use CaT

  • Test-time CaT: Don’t change the AI’s weights. Just spend extra compute to get multiple tries and synthesize a better final answer.
  • CaT-RL (training): Use the rewards from the synthesized answer or checklist to update the AI’s weights via reinforcement learning, so the AI improves for future questions.

Main Findings and Why They Matter

Here are the key results:

  • Better answers right away: CaT at test time improved accuracy on math (MATH-500) by up to about +27% and on health advice (HealthBench) by about +12%, compared to using a single try.
  • Even bigger gains with training: CaT-RL (the training version) improved performance further, up to about +33% on math and +30% on health, and often surpassed the initial teacher signal. That means the student can outgrow the teacher.
  • Checklists beat vague judging: In non-verifiable tasks, the self-made checklists (rubrics) led to more stable, useful rewards than a judge model that just says “this answer is similar to the synthesized one.” These rubrics performed competitively with human-made checklists from doctors.
  • Better than picking the “best try”: CaT beat selection methods like majority vote, lowest perplexity (most “confident” try), mutual predictability, and the model picking its own best. Because synthesis can produce a correct answer even if every individual try is wrong.
  • Scales with more tries: More parallel attempts generally helped, especially in math. In health tasks, improvements plateaued after a few tries, likely because freeform answers are harder to reconcile beyond a point.
  • Real reconciliation, not just another try: CaT meaningfully uses information across tries. It disagreed with majority vote on about 14% of math questions and was correct even when all tries were wrong about 1% of the time.

Why this matters: Many valuable tasks don’t have one “right” answer or are too expensive to label at scale. CaT shows that extra compute (more tries + synthesis) can replace missing labels and still push the AI to get better.

Implications and Impact

This research suggests a practical path forward when high-quality labels are rare, costly, or impossible:

  • Spend compute at test time to get stronger answers without changing the model.
  • Use CaT-RL to turn those stronger answers into training signals, improving the model over time.
  • Apply to both clear-cut tasks (like math) and open-ended tasks (like health guidance, dialogue, or writing), because the rubric approach turns fuzzy goals into checkable steps.

In short, when data labeling is the bottleneck, compute can be the teacher. This could help build more capable and reliable AI systems across many domains, using less human supervision and more smart use of the model’s own exploration.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of missing pieces and unresolved questions that future work could concretely address:

  • Anchor quality and choice
    • How does the capability of the frozen anchor affect synthesis fidelity and learning? Systematically vary anchor strength (smaller/larger than the learner, different model families) and quantify outcomes.
    • Would a lagged or periodically refreshed anchor (e.g., Polyak-averaged, EMA, or best checkpoint so far) yield better stability and teacher quality than a permanently frozen initial policy?
    • Can multi-anchor ensembles (diverse frozen estimators) improve robustness to single-anchor biases?
  • Question-blind synthesis
    • The anchor is “question-blind,” but rollouts may restate the question; how often does leakage occur and does it matter? Provide quantitative ablations where the question is rigorously stripped from rollouts.
    • What are the trade-offs of allowing the anchor to access the question (e.g., accuracy vs. selection bias)? Provide controlled comparisons.
  • Synthesis method design
    • The synthesis is a single forward pass; are there better reconciliation algorithms (e.g., fact aggregation with explicit contradiction resolution, iterative refinement, chain-of-thought aggregation, or voting on subclaims)?
    • How to measure and improve internal consistency of the synthesized reference (e.g., checking logical coherence, equation consistency, contradiction detection)?
    • Can the anchor attach confidence/calibration to each synthesized subclaim and propagate that into reward shaping?
  • Robustness to bad or adversarial rollouts
    • How resilient is synthesis to outliers, low-quality, or adversarial rollouts (prompt injection inside rollouts, coordinated misleading patterns)? Design stress tests and mitigation strategies (e.g., robust aggregation, outlier filtering).
    • Does CaT fail gracefully when most rollouts are wrong in the same way (systematic bias) vs. independently wrong? Characterize failure modes and propose defenses.
  • Diversity vs. plateau
    • The method plateaus as rollout diversity decreases; which diversity signals (lexical, semantic, reasoning-path diversity) best predict continued gains?
    • Can targeted exploration bonuses, temperature schedules, or disagreement-aware sampling maintain useful diversity longer without sacrificing quality?
    • Develop adaptive schedules to choose GG per prompt based on observed marginal benefit or diversity measures.
  • Reward hacking and judge reliability (non-verifiable domains)
    • To what extent do models exploit rubric or judge artifacts (verbosity, phrasing, formatting)? Provide adversarial evaluations and mitigation (e.g., randomized rubric paraphrases, judge ensembles, adversarial training).
    • Calibrate and audit the LLM judge: inter-judge agreement, sensitivity to surface form, susceptibility to model identity bias, and robustness to prompt injection.
    • Explore alternative judges (human-in-the-loop sampling, multi-judge aggregation, automatic consistency checks) and quantify trade-offs in cost and reliability.
  • Rubric quality and structure
    • How complete, non-redundant, and discriminative are self-proposed rubrics? Develop automatic metrics for rubric coverage and independence of criteria.
    • Study weighting and pruning of rubric items (importance weighting, difficulty-adaptive criteria) instead of uniform averaging.
    • Investigate rubric-length sensitivity (n too small: under-specification; n too large: noisy reward) and learn an optimal rubric size per task.
  • Verifiable domain evaluation
    • The math verifier uses simple string match; replace with robust equivalence (e.g., SymPy, CAS execution) and quantify label noise impact on training and evaluation.
    • Extend verifiers to multi-step proof validity (not just final answer) and paper whether synthesis over reasoning traces improves proof-level correctness.
  • Generalization and scope
    • Limited tasks (MATH-500, HealthBench) and model scales (4–8B); evaluate across code generation, tool-use, retrieval-augmented tasks, safety alignment, and multilingual settings, and on larger models (≥70B).
    • Test out-of-distribution robustness and transfer: does a CaT-trained model on one domain generalize to new domains without references?
  • Compute and cost efficiency
    • Provide full compute accounting (tokens, latency, memory) for synthesis and rubric judging; characterize FLOPs-to-quality trade-offs under fixed inference budgets.
    • Investigate cost-reduction strategies: cached synthesis, early stopping when rollouts agree, adaptive GG, selective judging (only “uncertain” criteria).
  • Comparisons and baselines
    • Compare against stronger selection/reconciliation baselines: multi-judge best-of-N, DERA/debate-style methods, STaR/Quiet-STaR variants, self-consistency with rationale grading, and tool-augmented verification.
    • Benchmark against RLAIF and human-preference RL where available to quantify alignment/quality gaps.
  • Training mechanics and stability
    • How sensitive is CaT-RL to the underlying RL algorithm (GRPO vs. PPO, DPO-style objectives, offline RL)? Provide cross-algorithm ablations.
    • Binary, sparse rewards may hinder credit assignment; paper reward shaping with criterion-level confidence, partial credit on near-misses, or token-level credit assignment.
  • Safety and domain risk
    • In high-stakes domains (e.g., health), what safeguards ensure synthesized references and rubric rewards do not entrench unsafe practices? Establish human oversight protocols and safety evaluations beyond rubric scoring.
    • Evaluate harmful content, hallucinations, and malpractice risks under CaT and CaT-RL, with domain expert audits.
  • Anchor and learner co-evolution
    • Can a teacher-student curriculum (increasing difficulty, self-play question generation) extend beyond reliance on fixed datasets and delay plateau?
    • What happens if the learner surpasses the anchor substantially—should the anchor be upgraded, or should we use cross-model anchors to avoid teacher drag?
  • Confidence calibration and disagreement handling
    • When CaT disagrees with the majority, how often is it correct across domains? Provide precision/recall of “non-majority” moves and confidence thresholds for triggering synthesis overrides.
    • Develop disagreement-aware uncertainty estimates to decide when to spend extra compute or consult external tools/humans.
  • Multi-turn and long-context interactions
    • HealthBench evaluation focuses on single responses; how does CaT perform in multi-turn dialogues, longitudinal plans, or stateful tasks where criteria depend on prior turns?
  • Contamination and leakage
    • Assess data contamination risks (e.g., model pretraining exposure to benchmarks, judge/model family coupling). Provide decontamination checks and cross-family evaluations to reduce evaluator–candidate favoritism.
  • Prompting and reproducibility
    • Sensitivity to synthesis and rubric prompts is not quantified; report prompt ablations, seed sensitivity, and robustness to paraphrases for reproducibility.
    • Publish full prompts and implementation details, plus statistical power analyses for reported gains.
  • Security considerations
    • Explore prompt-injection defenses for synthesis and judging when rollouts contain adversarial instructions aimed at steering the anchor or judge.
    • Develop sanitization pipelines and constrained decoding for synthesis to prevent malicious content propagation.
  • Extending beyond responses
    • The paper suggests synthesizing over reasoning traces; concretely specify and test methods for reconciling chain-of-thought steps, intermediate computations, and tool calls, and evaluate their effect on learning.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Anchor policy: A fixed model used solely to reconcile multiple rollouts into a single target, separate from the exploring policy. "we introduce a synthesis step, where we ask the anchor policy to reconcile the model's exploration, the parallel rollouts during GRPO, into a single, improved answer."
  • Best-of-N: A selection method that chooses the best output among N samples based on a heuristic or score. "Unlike selection methods (best-of-NN, majority, perplexity, or judge scores), synthesis may disagree with the majority and be correct even when all rollouts are wrong; performance scales with the number of rollouts."
  • Chain of thought: The intermediate reasoning steps generated by a model to arrive at an answer. "CaT may be naturally extended to synthesize over thinking and reasoning traces rather than only question responses and chain of thought."
  • Clipped surrogate: The PPO-style loss function that clips probability ratios to limit destructive updates. "with the clipped surrogate"
  • Entropy minimization: An approach that drives the model toward more confident (low-entropy) outputs. "This reflects prior work on trajectory-level confidence maximization and entropy minimization, e.g., \citet{agarwal2025unreasonable} and \citet{li2025confidence}."
  • FLOPs-for-supervision trade-off: Exchanging additional inference compute for better supervision signals. "yielding a practical FLOPs-for-supervision trade-off."
  • GRPO (Group Relative Policy Optimization): A PPO variant that uses a group baseline to avoid a value network. "GRPO \citep{shao2024deepseekmath} is a memory-efficient variant of PPO \citep{schulman2017proximal} that avoids a value network by using a group baseline."
  • Group baseline: A baseline computed from the mean reward of a group of rollouts, used to normalize advantages. "GRPO \citep{shao2024deepseekmath} is a memory-efficient variant of PPO \citep{schulman2017proximal} that avoids a value network by using a group baseline."
  • Group-relative advantages: Advantage estimates normalized relative to the group’s rewards rather than a learned value function. "which computes group-relative advantages with the group mean as baseline."
  • Importance weighting: Reweighting of samples by the ratio of new to old policy probabilities to form unbiased updates. "where the importance weighting token-level ratio and the group-normalized advantage are"
  • Judge-only feedback: Evaluation where an LLM assigns direct scores to outputs without structured criteria. "or (ii) judge-only feedback where another LLM assigns coarse scores to freeform outputs"
  • KL divergence (KL term): A regularization that penalizes deviation of the current policy from a reference policy. "the KL term discourages large policy drift from the reference $\pi_{\mathrm{ref}$ (typically the initial policy π0\pi_0)."
  • Knowledge distillation: Training a student model to imitate a teacher model’s outputs or distributions. "Like self-training \citep{schmidhuber2003exploring, schmidhuber2013powerplay, silver2016mastering, silver2018general} and knowledge distillation \citep{hinton2015distilling}, it learns from model-generated supervision,"
  • LLM-as-a-judge: A paradigm where an LLM evaluates and scores another model’s outputs. "Compared to LLM-as-a-judge \citep{zheng2023judging}, rubric-based scoring yields decomposed, specific criteria that mitigate instability and bias"
  • LLM judge: An independent LLM used to mark rubric criteria as satisfied or not. "self-proposed rubrics—binary, auditable criteria scored by an independent LLM judge"
  • Majority vote: Selection by choosing the most common answer among multiple outputs. "Unlike best-of-NN \citep{long2022training} or majority vote \citep{wang2023}, it constructs a new answer that can depart from consensus."
  • Min(PPL): A selection heuristic that picks the sample with the lowest perplexity. "In min(PPL), we select the response with the lowest trajectory perplexity under the model."
  • Mutual predictability (MP): A selection rule that prefers the rollout best predicted from the others. "In mutual predictability (MP) \citep{wen2025unsupervised}, we select the rollout with the highest probability when the model is conditioned on all other responses."
  • Non-verifiable domains: Tasks where there is no deterministic checker or single ground truth. "In non-verifiable domains, where rule-based answer checking is infeasible, a few methods have established ways to score outputs against references."
  • Perplexity: A measure of how well a probability model predicts a sequence; lower is better. "Unlike selection methods (best-of-NN, majority, perplexity, or judge scores),"
  • PPO (Proximal Policy Optimization): A policy gradient algorithm using a clipped objective to stabilize updates. "GRPO \citep{shao2024deepseekmath} is a memory-efficient variant of PPO \citep{schulman2017proximal} that avoids a value network by using a group baseline."
  • Programmatic checker: An automated mechanism for verifying correctness (e.g., exact match or executable tests). "or verifiable rewards from programmatic checkers \citep{lambert2024tulu, shao2024deepseekmath}."
  • Programmatic equivalence: Automatic checking that two answers are equivalent by deterministic rules. "verifiable tasks use programmatic equivalence on final answers;"
  • Programmatic verifier: A deterministic function that returns a binary correctness signal. "Let v(o,s){0,1}v(o,s)\in\{0,1\} be a programmatic verifier (e.g., final-answer equivalence via a simple string match or programmatic execution)."
  • Programmatic verification: Using formal or deterministic procedures to validate outputs. "CaT complements programmatic verification \citep{lambert2024tulu} by extending learning to non-verifiable domains where formal checkers are unavailable."
  • Reference-free supervision: Training signals derived without relying on human-provided reference answers. "which converts the model's own exploration at inference-time into reference-free supervision"
  • Rollout: A sampled response sequence generated by a policy for a given prompt. "the current policy produces a group of rollouts;"
  • Rubric rewards: Rewards computed from satisfying binary, auditable criteria rather than holistic judgments. "Rubric rewards decompose holistic judgment into auditable checks, mitigating verbosity and form bias"
  • Self-proposed rubrics: Rubrics generated by the model itself to define evaluable criteria. "non-verifiable tasks use self-proposed rubrics—binary, auditable criteria scored by an independent LLM judge, with reward given by the fraction satisfied."
  • Self-BoN: A baseline where the model selects its own best output among multiple samples. "Among alternatives, self-selected best-of-NN (Self-BoN), is a self-proposed baseline in which the model selects its own best response."
  • Self-training: Learning from data or targets produced by the model itself. "Like self-training \citep{schmidhuber2003exploring, schmidhuber2013powerplay, silver2016mastering, silver2018general} and knowledge distillation \citep{hinton2015distilling}, it learns from model-generated supervision,"
  • Synthesized reference: A constructed target answer that reconciles multiple rollouts into a single estimate. "produces a synthesized reference ss that reconciles omissions and contradictions across o1:Go_{1:G}."
  • Test-Time RL (TTRL): Reinforcement learning applied at inference time, often using consistency signals. "\citet{zuo2025ttrl} proposed Test-Time RL (TTRL), which uses self-consistent majority consensus answers \citep{wang2023} as label estimates for RL fine-tuning in math."
  • Trajectory perplexity: Perplexity computed over an entire generated sequence, used as a confidence proxy. "we select the response with the lowest trajectory perplexity under the model."
  • Value network: A learned critic that estimates expected return; avoided in GRPO by using a group baseline. "that avoids a value network by using a group baseline."
  • Verifiable domains: Tasks where outputs can be automatically checked for correctness. "Verifiable domains (e.g., math). We programmatically reward agreement of the response with the estimated reference, e.g., by checking whether answer strings match."
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 10 posts and received 780 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com