Create a Video View Paper

Rubric-Guided Self-Distillation: Post-Training Without Verifier Overhead

This lightning talk introduces Rubric-Guided Self-Distillation (RGSD), a novel post-training method that eliminates expensive language model verifiers while achieving competitive rubric satisfaction on open-ended tasks. By reframing rubrics as privileged context for dense token-level supervision rather than sparse evaluation signals, RGSD matches or exceeds traditional verifier-driven approaches with an order-of-magnitude reduction in computational cost, while avoiding verbosity drift and reward hacking pathologies.

Script

Training language models on rubric-graded tasks has traditionally required expensive verifier models to score every single response, creating a computational bottleneck that can mean over 1 million judge calls per domain. What if we could skip the judge entirely and still get better results?

The authors discovered a powerful empirical pattern called the conditioning gap. When a model sees a rubric alongside the prompt, its responses satisfy that rubric 30 to 45 percentage points better than when it sees the prompt alone, even though it's the exact same model.

Rubric-Guided Self-Distillation exploits this gap by creating a frozen teacher that sees the rubric and a trainable student that does not. At every token position, the student learns to match the teacher's distribution, internalizing rubric guidance through dense supervision rather than sparse end-of-trajectory scores.

Across medical and science domains, RGSD matches or exceeds the rubric satisfaction of traditional judge-driven training while using just one rollout per prompt instead of sixteen and zero judge calls instead of over a million. The efficiency gain is an order of magnitude.

But there's a hidden problem with judge-driven methods. While GRPO inflates response length by 1.4 to 2.3 times without improving scores, it also increases false claim rates from 30 percent to 45 percent on medical tasks. RGSD grows more modestly to 35 percent, avoiding the reward-hacking pathology.

Rubric-Guided Self-Distillation reframes rubrics from evaluative instruments into teacher context, proving that dense conditioning signals can replace external scalar rewards when the base model's rubric lift is strong. To explore how this paradigm shift applies to your own research or to create videos like this one, visit EmergentMind.com.