Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

Published 17 Jun 2026 in cs.AI and cs.CL | (2606.19327v1)

Abstract: Post-training of reasoning LLMs is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces RCSD which reframes rubric supervision for dense token-level guidance in LLM post-training.
RCSD employs a two-stage pipeline with rubric generation and rubric-guided reasoning, yielding a 4.7-point improvement on Qwen3-8B.
The framework outperforms GRPO and OPSD by 1.4 and 0.9 points respectively, and shows robust transferability across diverse benchmarks.

Rubric-Conditioned Self-Distillation for Reasoning LLMs

Motivation and Problem Formulation

The paper "Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation" (2606.19327) addresses a persistent limitation in post-training LLMs for reasoning tasks: the bottleneck of sparse, outcome-level reward assignment and the restrictive nature of reference-conditioned supervision. Reinforcement learning approaches (e.g., GRPO) traditionally optimize models against scalar rewards derived from binary correctness or execution verification, with little granularity in credit assignment. On-policy distillation frameworks (OPD/OPSD) offer denser token-level supervision but typically conditioned on a single privileged reference trajectory, thus constraining the space of valid responses and over-specifying reasoning paths.

Rubrics, textually structured sets of evaluation criteria, have recently emerged as a more interpretable and fine-grained supervision interface, especially in domains where high-quality answers escape simple verifiability. However, prior methods have reduced rubric judgments to scalar rewards or used rubrics purely as outcome-level signals, discarding rich criterion-level information during optimization. This work reframes rubrics as teacher-side privileged supervision for on-policy self-distillation, with the aim of preserving criterion-aware, token-level structure throughout model training.

Rubric-Conditioned Self-Distillation Framework

The proposed Rubric-Conditioned Self-Distillation (RCSD) framework introduces a two-stage pipeline:

Rubric Generation: A rubric generator is trained to amortize instance-specific evaluation criteria, enabling scalable rubric production from questions alone at inference, supported by privileged access to reference answers during training.
Rubric-Guided Reasoner Training: A reasoner model is trained to generate responses satisfying the learned rubric criteria. During optimization, the teacher model is conditioned on the rubric and provides dense token-level guidance to the student on its own rollouts, allowing learning signals to reflect multiple evaluation dimensions rather than a single scalar reward or reference trajectory.

Unlike RL-based reward optimization, RCSD leverages rubrics as structured, multi-dimensional feedback, which facilitates criterion-aware supervision on student-generated trajectories. The token-level distillation objective minimizes the KL divergence between the teacher's rubric-conditioned distribution and the student's distribution, encouraging the student to cover the support of the teacher's outputs across valid reasoning paths.

Experimental Results and Numerical Analysis

RCSD was evaluated on diverse reasoning and science benchmarks (GPQA-D, SCIBENCH, PIQA, RaR, ResearchQA, RubricHub), using Qwen3-8B as the backbone. RCSD achieves the highest overall average score (70.6), outperforming GRPO by 1.4 points and OPSD by 0.9 points. The gains are particularly pronounced in rubric-based and open-ended reasoning tasks, where response quality is not captured by outcome-level rewards alone. RCSD improves the base model Qwen3-8B by 4.7 points, with notable improvements on RESEARCHQA (+8.2) and RUBRICHUB (+4.9).

Ablation studies confirm the superiority of forward KL as the distillation objective, which yields broader output distributions and stable training dynamics. RCSD maintains competitive performance on medical-domain benchmarks (MEDMCQA, PUBMEDQA), demonstrating robust transferability and absence of catastrophic forgetting across adjacent domains.

The effectiveness of RCSD persists across scales (1.7B, 4B, 8B), with consistent improvements over baseline models at all sizes. The learned rubric generator approaches the effectiveness of manually curated reference rubrics, with only minor performance gaps, further validating the scalability of rubric generation and minimizing the dependence on handcrafted annotations.

Sensitivity analyses reveal that even imperfect rubric supervision (generic, noisy, random, reduced variants) contributes to performance improvements over the base model, though the greatest benefit is derived from instance-specific learned rubrics. Notably, RCSD remains effective with concise rubrics, indicating that quality and relevance of criteria outweigh rubric verbosity.

Theoretical and Practical Implications

RCSD represents a shift from outcome-level reward supervision or path-specific reference distillation to criterion-aware, verifier-free post-training. By conditioning teacher signals on rubric criteria, the framework facilitates finer-grained credit assignment and preserves response diversity across valid reasoning paths. This is theoretically grounded in robust distillation objectives that maintain support over rubric-conditioned teacher distributions, rather than collapsing to dominant modes or overfitting reference solutions.

Practically, RCSD is highly scalable owing to its automated rubric generator, obviating the need for manual rubric annotation or large teacher models producing rubrics directly. The dense token-level guidance enables improved internal response consistency and efficiency, as evidenced by case studies comparing solution trajectories on scientific reasoning tasks. RCSD responses are shorter, less repetitive, and more numerically stable, compared to both OPSD and base models.

Future Directions and Impact on AI Research

The introduction of rubrics as privileged supervision interfaces opens new avenues for LLM post-training in domains where evaluation is inherently subjective, multi-faceted, or challenging to formalize. RCSD advances a methodology for leveraging structured textual feedback that is compatible with open-ended question answering, scientific problem-solving, and instructional tasks. Further research may extend RCSD to interactive settings with adaptive rubric refinement, explore rubric conditioning in multi-agent reasoning, and develop more compact yet task-optimal rubric generation models.

The criterion-aware self-distillation paradigm is broadly applicable to improving internal chain-of-thought quality, interpretability, and response efficiency in LLMs, and offers a principled alternative to reward-model dependence or reference-answer imitation. RCSD suggests that fine-grained supervision interfaces, grounded in structured textual rubrics, are critical for advancing LLM reasoning beyond simple correctness and toward multi-dimensional response excellence.

Conclusion

Rubric-Conditioned Self-Distillation leverages rubrics as structured, privileged teacher-side supervision for dense, criterion-aware token-level guidance during LLM post-training. By operationalizing fine-grained evaluation criteria through a scalable rubric generator, RCSD avoids scalar reward bottlenecks and path-specific bias, achieving state-of-the-art performance on diverse reasoning benchmarks and maintaining robust generalization across domains and scales. The method sets a foundation for future research on verifier-free, structured, multi-dimensional LLM post-training via rubric interfaces, with substantial implications for reasoning quality, supervision efficiency, and interpretability in modern AI systems (2606.19327).

Markdown Report Issue