Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

Published 5 Jun 2026 in cs.AI | (2606.07000v1)

Abstract: Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-LLMs (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision for failed rollouts, often leading to inefficient exploration in complex multimodal reasoning tasks. Although policy distillation can offer dense guidance, external teacher based methods introduce substantial computational overhead, while answer conditioned tuning methods may expose answer-level information and induce shortcut-like generation behavior. To address these limitations, we propose PTD-PO, a Privileged Tutoring Distillation Policy Optimization framework for RLVR that provides dense guidance without exposing the answer to the student policy. Specifically, PTD-PO constructs structured privileged hints from spatial attention guidance and intermediate textual reasoning steps, and uses them through in-context learning to produce step-wise token-distribution supervision. The student is still optimized under the original answer-free context, and its failed rollouts are aligned with the hint-augmented reference model at the token-distribution level. To further stabilize distillation under the distribution shift between guided and unguided contexts, we introduce a Top-K Jensen-Shannon divergence objective that focuses alignment on informative token probabilities while reducing memory overhead. Experiments on LVLMs ranging from 2B to 8B parameters show that PTD-PO consistently outperforms RLVR and distillation baselines, mitigates entropy collapse, and improves complex multimodal reasoning performance.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces PTD-PO, a privileged tutoring framework that provides token-level, answer-free hints to improve multimodal policy learning.
It employs Group Relative Policy Optimization with a Top-K Jensen-Shannon divergence objective to convert failed rollouts into actionable learning signals.
Empirical tests on vision-language models show PTD-PO achieving higher accuracy (e.g., 74.85% vs 71.25%) and robust reasoning improvements.

Privileged Tutoring Distillation for Multimodal Policy Optimization

Motivation and Problem Formulation

Recent developments in multimodal RL post-training, especially Reinforcement Learning with Verifiable Rewards (RLVR), have improved large vision-LLMs (LVLMs) in complex reasoning. While RLVR eliminates dependence on costly human preferences, it only provides sparse scalar rewards at the answer level, thus lacking actionable token-level guidance for failed episodes. This reward sparsity hinders effective credit assignment, frequently leading to ineffectual exploration and stagnation in high-dimensional reasoning spaces. Orthodox solutions, such as on-policy distillation with external teachers, provide denser supervision but induce substantial compute and deployment costs, and answer-conditioned self-distillation (e.g., hybrid distillation) inevitably exposes ground-truth answers, causing shortcut bias, mode collapse, and reduced exploration diversity.

PTD-PO: Architecture and Mechanism

The PTD-PO (Privileged Tutoring Distillation Policy Optimization) framework addresses these critical issues with a privileged-information paradigm. Rather than exposing answers or full solution chains, PTD-PO supplies carefully structured, answer-free hints as privileged context for the distillation teacher. These hints encapsulate spatial attention cues and intermediate reasoning guidance, generated offline by strong expert models under explicit zero-spoiler constraints. Critically, this privileged information is only available to the reference teacher during distillation and is never leaked to the student policy or present during rollout.

Operationally, the student policy operates under the original, answer-free context and is updated with GRPO (Group Relative Policy Optimization) on trajectories partitioned by verifiable reward outcome. Failed rollouts receive additional, dense, token-level supervision by matching to the hint-augmented self-teacher’s token distributions, using a Top-K Jensen-Shannon divergence objective with tail compensation. This objective is specifically designed to avoid the instabilities and over-regularization of directional KL divergences under highly asymmetric context. Only failed trajectories (or those sampled in groups with overall low success) are distilled, ensuring that reward-driven successful behaviors remain unconstrained by privileged context.

Empirical Analysis and Numerical Results

Extensive benchmarking on standardized multimodal reasoning test suites (e.g., MathVerse, MathVista, MMMU, LogicVista) and model families (Qwen3-VL-2B/4B/8B) demonstrates that PTD-PO consistently yields strong performance gains over both classical RLVR and distillation baselines. For example, in the 8B model regime, PTD-PO achieves 74.85 average accuracy on vision-dependent reasoning, compared to 71.25 for GRPO and 71.89 for hybrid answer-distillation, confirming the practical superiority of answer-free privileged hinting.

Notably, PTD-PO also exhibits improved recovery rates in hard cases—frequently converting all-fail scenarios into partial or full successes, without inducing shortcut-driven answer copying. Analyses indicate that structured hint constraints (insisting on visual grounding, logic trap avoidance, and concise reasoning cues) are crucial; removal of these constraints significantly degrades distillation effectiveness and, in some cases, harms strong models.

Ablation further shows that tailoring hint-based distillation exclusively to failed trajectories yields the largest gains, while over-regularization by universal application lessens benefits. Furthermore, the Top-K JSD with tail bucket design retains full mass alignment over high-probability tokens while reducing memory usage by an order of magnitude, achieving a scalable, deployable token-level distillation regime.

Theoretical Implications

PTD-PO formalizes a general auxiliary correction signal for RLVR-trained policies, situated at the intersection of learning with privileged information and distributional policy optimization. By leveraging answer-free yet logically aligned hints, PTD-PO aligns student policies with improved (but non-leaking) reasoning strategies, circumventing the pitfalls of shortcut entrenchment and entropy collapse. Theoretical analysis establishes that the JSD objective under asymmetric contexts mediates a bounded, symmetric signal, robustly transferring corrective guidance without destabilizing mode-seeking behavior.

The selective routing of distillation to failed trajectories aligns the gradient with reward-improvement, while the context mismatch is explicitly controlled through the divergence design and the role separation between privileged teacher and student.

Broader Impact and Future Directions

Practically, PTD-PO is demonstrated to be plug-and-play compatible with other RLVR optimizers (e.g., DAPO, GSPO), confirming that privileged distillation is a broadly applicable auxiliary that boosts post-training regardless of the underlying policy gradient variant.

A limitation remains that, on very strong models, the absolute benefit is bounded by the occurrence of challenging, failed rollouts. As models saturate existing corpus difficulty, larger performance deltas will necessitate harder, more compositionally difficult datasets to manifest more informative failures suitable for privileged intervention. Additionally, PTD-PO currently targets static multimodal reasoning; extending privileged hint construction to dynamic, agentic settings—where hints can be derived from tool outputs or environmental feedback—opens promising avenues for future research in robust multimodal agents.

Conclusion

PTD-PO sets a new bar for post-training multimodal policy optimization: it introduces a scalable, dense, and non-spoiling distillation protocol that transforms the structure of failed rollouts into a source of privileged, but not answer-leaking, supervision. By providing token-level corrective signals derived from offline-constructed, context-sensitive hints, it effectively augments sparse reward exploration, mitigates entropy collapse, and enhances both general and vision-dependent reasoning performance. PTD-PO is thus established as a theoretically and practically robust framework for dense policy improvement in LVLMs without compromising long-horizon reasoning diversity or exposing models to shortcut biases (2606.07000).