Compute as Teacher (CaT) in AI Supervision

Updated 18 September 2025

Compute as Teacher (CaT) is a method that uses a frozen anchor model to synthesize a reference signal from multiple model rollouts, enabling reference-free supervision.
It leverages parallel exploration and synthesis to correct errors and integrate diverse outputs, thereby improving inference-time performance across tasks.
The approach applies to both verifiable tasks, such as mathematical reasoning, and non-verifiable tasks, like free-form dialogue, enhancing overall model reliability.

Compute as Teacher (CaT) refers to a class of methods in machine learning and artificial intelligence that transform additional computation—typically in the form of parallel exploration or extra inference runs—into a source of supervision. Rather than depending solely on human-provided ground-truth labels or static references, CaT synthesizes a new reference signal directly from the model’s own outputs, or from a structured aggregation conducted by a frozen anchor model. This synthesized supervisory signal, which is reference-free, is then used both to improve inference-time performance and as a reward for post-training optimization via reinforcement learning. The core paradigm enables reference-free supervision in scenarios where ground truth is absent at inference or post-training time and can be applied to both verifiable tasks (such as mathematical reasoning) and non-verifiable tasks (such as free-form dialogue).

1. Conceptual Overview

CaT is predicated on leveraging the model’s own computational exploration at inference-time as a “teacher” signal. Specifically, for every input prompt, the current policy $\pi_t$ generates a set of $G$ parallel rollouts $\{o_1, ..., o_G\}$ . These rollouts represent diverse plausible answers or solutions produced independently by the model. Crucially, instead of selecting the best among these rollouts using traditional methods (e.g., best-of- $N$ , majority voting), CaT introduces a “synthesis” step: a fixed anchor model $\pi_0$ —typically the unadapted initial policy, held frozen—receives the set of rollouts and, via a specialized prompt, produces a single synthesized reference answer $s$ . This step reconciles information across diverse outputs: filling gaps, correcting errors, and integrating complementary evidence. The synthesized answer $s$ then acts as a dynamic, reference-free “teacher” signal that can enhance inference-time predictions and serve as the basis for further training.

2. Technical Methodology

The CaT protocol comprises the following stages:

Parallel Rollout Generation

$o_i \sim \pi_t(\cdot | q) \quad \text{for} \quad i = 1, ..., G$

where $q$ is the input prompt and $\pi_t$ is the current policy.

Synthesis by Anchor Model

$s \sim \pi_0(\cdot | p_{\text{syn}}, o_{1:G})$

where $\pi_0$ is the frozen anchor and $p_{\text{syn}}$ is the synthesis prompt, which excludes direct access to $q$ .

Reward Construction
- Verifiable Tasks: For domains where an output can be programmatically verified (e.g., mathematics), an equivalence check is performed:
$R_{\text{ver}}(o; s) = v(o, s)$

where $v$ is a binary equivalence function comparing $o$ to $s$ . - Non-Verifiable Tasks: For domains lacking verifiable outputs, a set of rubric criteria $\mathcal{R} = \{r_1, ..., r_n\}$ is synthesized from $s$ using a rubric prompt $p_{\text{rub}}$ . An independent judge model $\pi_J$ assesses whether $o$ satisfies each criterion. The reward is the fraction satisfied:

$R_{\text{rub}}(o; \mathcal{R}) = \frac{1}{n} \sum_{i=1}^{n} 1[\pi_J(p_J; o, r_i) = \text{"yes"}]$
Integration with Reinforcement Learning The synthesized supervision (verifiable or rubric-derived) is used as the reward in reinforcement learning, commonly via Group Relative Policy Optimization (GRPO), updating $\pi_t$ as:

$J_{\text{GRPO}}(\theta) = \mathbb{E}_{q, o_{1:G}}[\text{token-level objective}] - \beta D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]$

where $D_{\text{KL}}$ is the Kullback-Leibler divergence penalizing deviation from a reference policy.

This decoupling of exploration ( $\pi_t$ emits candidate solutions) and synthesis (frozen $\pi_0$ creates the reference) stabilizes the supervision and enables consistent improvement.

3. Task Domains and Reward Regimes

CaT is deployed differently depending on task verification:

Verifiable Tasks: In mathematical reasoning or program synthesis, correctness can be checked programmatically. CaT synthesizes a reference answer and assigns binary rewards by matching output equivalence.
Non-Verifiable Tasks: In open-ended tasks (e.g., medical advice, creative writing), the synthesized reference is parsed into auditable rubric criteria. An independent judge (LLM) scores outputs in a binary, per-criterion fashion. Reward is the proportion of rubrics satisfied.

This dual-regime reward system extends the reach of CaT to domains without explicit ground truth.

4. Empirical Performance and Scaling Properties

Experiments were conducted using Gemma 3 4B, Qwen 3 4B, and Llama 3.1 8B on both verifiable (MATH-500) and non-verifiable (HealthBench) benchmarks. Key findings:

Inference-Time Improvements: On MATH-500, CaT provided up to +27% improvement relative to the initial policy. On HealthBench, the gain was up to +12%.
Post-Training via RL: Integrating CaT-derived rewards into a reinforcement learning loop (CaT-RL) led to further gains—up to +33% (MATH-500) and +30% (HealthBench).
Surpassing the Teacher: Policies trained via CaT-RL can exceed the initial synthesized teacher reference signal, indicating that the RL process enables further optimization beyond the original anchor.
Scalability: Performance scales with the number of rollouts $G$ used for synthesis, especially in verifiable tasks. Diversity among rollouts increases the likelihood that omissions, contradictions, or errors can be collectively corrected by the anchor, leading to higher-quality references. In non-verifiable domains, benefits may plateau after a moderate number of rollouts.

5. Comparison to Selection-Based Strategies

Conventional post-selection methods—best-of- $N$ , majority voting, perplexity minimization, or judge scoring—restrict output to one of the $G$ original rollouts. In contrast, CaT’s synthesis step constructs a new answer potentially absent from all rollouts. Key advantages:

Error Correction: The synthesized answer can disagree with the majority, correcting systematic errors across all $G$ outputs.
Information Integration: Complementary evidence scattered across rollouts can be integrated—compensating for contradictions or partial omissions.
Majority Disagreement: Empirical instances revealed cases where synthesis produced a correct answer despite no correct rollout among candidates—a capability selection-based schemes lack.

Empirical results consistently indicated that CaT synthesis outperforms these baseline strategies across evaluation metrics.

6. Significance and Implications

CaT inaugurates a practical protocol for reference-free supervision when explicit ground truth is unavailable at test time or during post-training adaptation. Its design transforms “extra compute” at inference (parallel rollouts) into a high-quality teacher signal by leveraging synthesis from an independent, frozen anchor. The framework accommodates both domains where verification functions exist and where only audit-based rubrics are available.

This suggests several implications:

Test-Time Adaptivity: Models may improve in real time by investing extra inference compute without external annotation.
Scaling Laws: Utility grows with available compute and number of rollouts, indicating a trade-off between computation and data annotation.
Post-Training Optimization: CaT can serve as a reference for further RL fine-tuning in the absence of external supervision, enabling continual improvement.

A plausible implication is that this paradigm could generalize to other settings where ensemble synthesis and anchor models can reconcile exploration outputs, introducing new strategies for self-improving systems during deployment.

7. Limitations and Future Research

While CaT demonstrates compelling gains and scalability, some open questions remain:

Synthesis Quality: The quality and stability of the synthesized reference depend on the robustness of the anchor and the diversity among the $G$ rollouts. Potentially, poorly initialized anchors or low-variance exploration may limit performance.
Rubric Engineering: In non-verifiable domains, the fidelity of rubric extraction by the anchor and its auditability by judges is critical. Misalignment here may introduce bias or noise in reward construction.
Computational Cost: The method trades off more inference compute (multiple rollouts per prompt) against annotation or offline data curation. For large models or latency-sensitive applications, the cost-benefit balance requires careful tuning.

Future work may focus on:

Dynamic selection of the number of rollouts based on uncertainty estimation.
Anchors that are separately optimized or ensemble-based for greater synthesis robustness.
Automated metric development for judging synthesis quality in absence of ground truth.

This encapsulation of Compute as Teacher reflects its formal foundations, technical protocols, and empirical findings, as well as its role within the broader landscape of reference-free supervision and test-time adaptation (Jayalath et al., 17 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Compute as Teacher (CaT).