Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 74 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Compute as Teacher (CaT) in AI Supervision

Updated 18 September 2025
  • Compute as Teacher (CaT) is a method that uses a frozen anchor model to synthesize a reference signal from multiple model rollouts, enabling reference-free supervision.
  • It leverages parallel exploration and synthesis to correct errors and integrate diverse outputs, thereby improving inference-time performance across tasks.
  • The approach applies to both verifiable tasks, such as mathematical reasoning, and non-verifiable tasks, like free-form dialogue, enhancing overall model reliability.

Compute as Teacher (CaT) refers to a class of methods in machine learning and artificial intelligence that transform additional computation—typically in the form of parallel exploration or extra inference runs—into a source of supervision. Rather than depending solely on human-provided ground-truth labels or static references, CaT synthesizes a new reference signal directly from the model’s own outputs, or from a structured aggregation conducted by a frozen anchor model. This synthesized supervisory signal, which is reference-free, is then used both to improve inference-time performance and as a reward for post-training optimization via reinforcement learning. The core paradigm enables reference-free supervision in scenarios where ground truth is absent at inference or post-training time and can be applied to both verifiable tasks (such as mathematical reasoning) and non-verifiable tasks (such as free-form dialogue).

1. Conceptual Overview

CaT is predicated on leveraging the model’s own computational exploration at inference-time as a “teacher” signal. Specifically, for every input prompt, the current policy πt\pi_t generates a set of GG parallel rollouts {o1,...,oG}\{o_1, ..., o_G\}. These rollouts represent diverse plausible answers or solutions produced independently by the model. Crucially, instead of selecting the best among these rollouts using traditional methods (e.g., best-of-NN, majority voting), CaT introduces a “synthesis” step: a fixed anchor model π0\pi_0—typically the unadapted initial policy, held frozen—receives the set of rollouts and, via a specialized prompt, produces a single synthesized reference answer ss. This step reconciles information across diverse outputs: filling gaps, correcting errors, and integrating complementary evidence. The synthesized answer ss then acts as a dynamic, reference-free “teacher” signal that can enhance inference-time predictions and serve as the basis for further training.

2. Technical Methodology

The CaT protocol comprises the following stages:

  1. Parallel Rollout Generation

oiπt(q)fori=1,...,Go_i \sim \pi_t(\cdot | q) \quad \text{for} \quad i = 1, ..., G

where qq is the input prompt and πt\pi_t is the current policy.

  1. Synthesis by Anchor Model

sπ0(psyn,o1:G)s \sim \pi_0(\cdot | p_{\text{syn}}, o_{1:G})

where π0\pi_0 is the frozen anchor and psynp_{\text{syn}} is the synthesis prompt, which excludes direct access to qq.

  1. Reward Construction

    • Verifiable Tasks: For domains where an output can be programmatically verified (e.g., mathematics), an equivalence check is performed:

    Rver(o;s)=v(o,s)R_{\text{ver}}(o; s) = v(o, s)

    where vv is a binary equivalence function comparing oo to ss. - Non-Verifiable Tasks: For domains lacking verifiable outputs, a set of rubric criteria R={r1,...,rn}\mathcal{R} = \{r_1, ..., r_n\} is synthesized from ss using a rubric prompt prubp_{\text{rub}}. An independent judge model πJ\pi_J assesses whether oo satisfies each criterion. The reward is the fraction satisfied:

    Rrub(o;R)=1ni=1n1[πJ(pJ;o,ri)="yes"]R_{\text{rub}}(o; \mathcal{R}) = \frac{1}{n} \sum_{i=1}^{n} 1[\pi_J(p_J; o, r_i) = \text{"yes"}]

  2. Integration with Reinforcement Learning The synthesized supervision (verifiable or rubric-derived) is used as the reward in reinforcement learning, commonly via Group Relative Policy Optimization (GRPO), updating πt\pi_t as:

JGRPO(θ)=Eq,o1:G[token-level objective]βDKL[πθπref]J_{\text{GRPO}}(\theta) = \mathbb{E}_{q, o_{1:G}}[\text{token-level objective}] - \beta D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]

where DKLD_{\text{KL}} is the Kullback-Leibler divergence penalizing deviation from a reference policy.

This decoupling of exploration (πt\pi_t emits candidate solutions) and synthesis (frozen π0\pi_0 creates the reference) stabilizes the supervision and enables consistent improvement.

3. Task Domains and Reward Regimes

CaT is deployed differently depending on task verification:

  • Verifiable Tasks: In mathematical reasoning or program synthesis, correctness can be checked programmatically. CaT synthesizes a reference answer and assigns binary rewards by matching output equivalence.
  • Non-Verifiable Tasks: In open-ended tasks (e.g., medical advice, creative writing), the synthesized reference is parsed into auditable rubric criteria. An independent judge (LLM) scores outputs in a binary, per-criterion fashion. Reward is the proportion of rubrics satisfied.

This dual-regime reward system extends the reach of CaT to domains without explicit ground truth.

4. Empirical Performance and Scaling Properties

Experiments were conducted using Gemma 3 4B, Qwen 3 4B, and Llama 3.1 8B on both verifiable (MATH-500) and non-verifiable (HealthBench) benchmarks. Key findings:

  • Inference-Time Improvements: On MATH-500, CaT provided up to +27% improvement relative to the initial policy. On HealthBench, the gain was up to +12%.
  • Post-Training via RL: Integrating CaT-derived rewards into a reinforcement learning loop (CaT-RL) led to further gains—up to +33% (MATH-500) and +30% (HealthBench).
  • Surpassing the Teacher: Policies trained via CaT-RL can exceed the initial synthesized teacher reference signal, indicating that the RL process enables further optimization beyond the original anchor.
  • Scalability: Performance scales with the number of rollouts GG used for synthesis, especially in verifiable tasks. Diversity among rollouts increases the likelihood that omissions, contradictions, or errors can be collectively corrected by the anchor, leading to higher-quality references. In non-verifiable domains, benefits may plateau after a moderate number of rollouts.

5. Comparison to Selection-Based Strategies

Conventional post-selection methods—best-of-NN, majority voting, perplexity minimization, or judge scoring—restrict output to one of the GG original rollouts. In contrast, CaT’s synthesis step constructs a new answer potentially absent from all rollouts. Key advantages:

  • Error Correction: The synthesized answer can disagree with the majority, correcting systematic errors across all GG outputs.
  • Information Integration: Complementary evidence scattered across rollouts can be integrated—compensating for contradictions or partial omissions.
  • Majority Disagreement: Empirical instances revealed cases where synthesis produced a correct answer despite no correct rollout among candidates—a capability selection-based schemes lack.

Empirical results consistently indicated that CaT synthesis outperforms these baseline strategies across evaluation metrics.

6. Significance and Implications

CaT inaugurates a practical protocol for reference-free supervision when explicit ground truth is unavailable at test time or during post-training adaptation. Its design transforms “extra compute” at inference (parallel rollouts) into a high-quality teacher signal by leveraging synthesis from an independent, frozen anchor. The framework accommodates both domains where verification functions exist and where only audit-based rubrics are available.

This suggests several implications:

  • Test-Time Adaptivity: Models may improve in real time by investing extra inference compute without external annotation.
  • Scaling Laws: Utility grows with available compute and number of rollouts, indicating a trade-off between computation and data annotation.
  • Post-Training Optimization: CaT can serve as a reference for further RL fine-tuning in the absence of external supervision, enabling continual improvement.

A plausible implication is that this paradigm could generalize to other settings where ensemble synthesis and anchor models can reconcile exploration outputs, introducing new strategies for self-improving systems during deployment.

7. Limitations and Future Research

While CaT demonstrates compelling gains and scalability, some open questions remain:

  • Synthesis Quality: The quality and stability of the synthesized reference depend on the robustness of the anchor and the diversity among the GG rollouts. Potentially, poorly initialized anchors or low-variance exploration may limit performance.
  • Rubric Engineering: In non-verifiable domains, the fidelity of rubric extraction by the anchor and its auditability by judges is critical. Misalignment here may introduce bias or noise in reward construction.
  • Computational Cost: The method trades off more inference compute (multiple rollouts per prompt) against annotation or offline data curation. For large models or latency-sensitive applications, the cost-benefit balance requires careful tuning.

Future work may focus on:

  • Dynamic selection of the number of rollouts based on uncertainty estimation.
  • Anchors that are separately optimized or ensemble-based for greater synthesis robustness.
  • Automated metric development for judging synthesis quality in absence of ground truth.

This encapsulation of Compute as Teacher reflects its formal foundations, technical protocols, and empirical findings, as well as its role within the broader landscape of reference-free supervision and test-time adaptation (Jayalath et al., 17 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Compute as Teacher (CaT).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube