Papers
Topics
Authors
Recent
Search
2000 character limit reached

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

Published 21 May 2026 in cs.LG and cs.AI | (2605.22731v1)

Abstract: LLM post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.

Authors (1)

Summary

  • The paper demonstrates that state distribution is key in LLM post-training, influencing both performance improvement and retention.
  • It compares off-policy SFT with on-policy techniques like RL and OPD, highlighting distinct trade-offs in stability and capability retention.
  • Experimental findings indicate that focusing on local state supervision reduces catastrophic forgetting, suggesting novel hybrid algorithm designs.

State Distribution-Centric Post-Training of LLMs

Overview

This paper introduces a state-distribution framework for analyzing and designing post-training procedures in autoregressive LLMs, emphasizing the critical role played by the distribution of states—prompt plus generated prefix—used during optimization. The work provides a unified lens for supervised fine-tuning (SFT), reinforcement learning (RL), and on-policy distillation (OPD), arguing that the locus of supervision (off-policy vs. on-policy) is often as impactful as the loss function itself. Through systematic experiments with Qwen3-0.6B-Base on GSM8K, TruthfulQA, and MMLU, the paper establishes that state distribution is central to both model improvement and retention of prior capabilities.

The State Distribution Framework

The paper formalizes LLM post-training as the shaping of a model's induced state distribution, dπ(s)d_\pi(s), where a "state" is the conditioning context of a prompt together with the generated prefix. The authors propose two key axes for each post-training algorithm:

  • State source: Specifies whether supervision occurs on dataset-derived (off-policy) states or learner-induced (on-policy) states.
  • Signal source: Describes whether the learning signal comes from labels, rewards, teacher predictions, or continuations.

By isolating these axes, the analysis transcends objective-centric views and exposes mechanistic explanations for phenomena such as catastrophic forgetting, teacher-student reversal, and local improvement, which are not easily accounted for by token-level losses alone.

Algorithmic Insights

SFT (Supervised Fine-Tuning)

SFT applies dense cross-entropy supervision on fixed dataset states, making it inherently off-policy. When the dataset states are compatible with the learner's own induced states, SFT is efficient and stable. However, aggressive or narrow off-policy SFT causes the model to update in regions of the state space unvisited during inference, precipitating catastrophic forgetting and degraded performance even on the target task.

RL (Reinforcement Learning)

RL optimizes rewards on trajectories sampled from the model's own policy. As an on-policy method, RL updates only on states the current model visits, preserving locality and avoiding global drift. RL may suffer from sample inefficiency due to sparse rewards but exhibits robust retention of non-target capabilities because state drift remains bounded and recoverable.

OPD (On-Policy Distillation)

OPD decouples supervision from state sampling: the student generates states; the teacher gives local guidance only on those states. This framework is analogous to DAgger-style imitation learning and enables students to surpass degraded teachers—a result attributed to learning from teacher repairs in locally reachable states while avoiding inheriting teacher trajectory-level errors.

Experimental Results

  • Mild SFT improved GSM8K from 0.448 to 0.512 with negligible retention loss (<1.5%<1.5\%), demonstrating that SFT is not inherently destructive.
  • Stress SFT led to substantial retention degradation (retention ratio $0.8258$, forgetting $0.0635$), reducing both general and target task performance.
  • OPD from degraded teacher yielded GSM8K $0.466$ (vs. teacher's $0.420$), TruthfulQA $0.275$ (vs. $0.245$), and MMLU $0.430$ (vs. $0.364$), confirming the student outperformed a degraded teacher on all axes.
  • On-policy RL improved GSM8K to <1.5%<1.5\%0 while nearly perfectly retaining baseline performance (mean forgetting <1.5%<1.5\%1).

A notable finding is that maximum mean discrepancy (MMD) drift between base and post-trained rollouts is insufficient as a scalar measure; stress SFT and OPD from stress teacher exhibited similar MMD drift but vastly different retention outcomes, emphasizing that state-source locality dominates retention dynamics.

Implications and Theoretical Perspectives

The thesis that state-distribution locality is central to post-training leads to several substantial implications:

  • Algorithm Design: The separation of state and signal sources enables new hybrid algorithms, e.g., on-policy dense shaping (continuation-based OPD), which preserves both task specialization and broad retention.
  • Stability: On-policy post-training methods (RL, OPD) are less likely to induce catastrophic forgetting because updates are constrained to locally recoverable states, unlike dense off-policy SFT.
  • Teacher-Student Dynamics: Students trained with OPD can systematically improve over degraded teachers due to selective learning from teacher repairs—not full trajectory imitation.

The findings motivate richer state-distribution analyses, beyond simple drift metrics, to characterize training dynamics and to inform more principled post-training protocols.

Practical Impact and Future Directions

This state-centric framework offers tangible guidance for practitioners: aggressive specialization with SFT should be avoided where broad retention is needed, and on-policy supervision (RL, OPD with continuation) should be considered for improved stability and task transfer. The approach suggests that future developments in AI post-training may focus on adaptive state-distribution control, advanced supervision shaping (trajectory-level guidance), and new metrics for modeling recoverability and locality in state space.

Further research should scale these methods to larger models, more diverse tasks, richer drift metrics (embedding-based), and integrate reference models for more robust state supervision. Exploration of adaptive state sampling and dynamic signal provision may also yield improved algorithms for safe and effective post-training.

Conclusion

The paper presents a rigorous state-distribution view of LLM post-training, demonstrating experimentally that on-policy locality and state-source selection are primary drivers of retention, forgetting, and improvement. Token-level objectives and scalar drift metrics are incomplete without consideration of training state distributions. The results challenge objective-centric dogma and open new avenues for principled post-training methods that balance specialization and generalization (2605.22731).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about

This paper asks a simple question with a big impact: when we “polish” a LLM after pretraining so it follows instructions or reasons better, what matters most? The authors argue it’s not just the kind of signal we use (like labels or rewards). It’s also where we apply that signal — the situations the model actually finds itself in while generating text. They call those situations states.

Their main message: post-training is about states, not just tokens. Training on the learner’s own states (on-policy) often leads to safer, more stable improvements than training only on fixed dataset states (off-policy).

The big questions in simple terms

  • When we fine-tune a model with examples (SFT), use rewards (RL), or learn from another model (distillation), why do some runs help while others cause the model to “forget” things it used to know?
  • Can a student model trained with help from a weak teacher actually end up better than that teacher?
  • Why does reinforcement learning sometimes give stable gains even with noisy or sparse feedback?
  • Is one “distance number” that measures how the model’s behavior shifted enough to explain forgetting?

How they studied it

Key idea: what is a “state”?

  • Think of writing as a step-by-step journey. At each step, the model has:
    • the original prompt (the question), plus
    • everything it has written so far (the partial answer).
  • Together, that prompt + partial answer is the state — the model’s current situation. The next word it picks is the action in that situation.
  • The set of states the model tends to visit during writing is its state distribution.

Three post‑training methods through the “state” lens

  • SFT (Supervised Fine-Tuning): The model learns from fixed example answers. These states come from a dataset, not from the model’s own writing. That’s off-policy — like practicing only the “perfect” solutions someone else wrote, not the situations you personally stumble into.
  • RL (Reinforcement Learning): The model writes its own answers, then gets a reward (good/bad) and updates. These states are the model’s own — on-policy — like practicing on your actual mistakes and successes.
  • OPD (On-Policy Distillation): The student model writes its own partial answer (student states). Then a teacher suggests what to do next in those exact states. This keeps the dense guidance of a teacher while staying on-policy with the student’s situations.

Analogy: Learning to drive

  • Off-policy (SFT): You only study ideal dashcam videos and try to imitate them.
  • On-policy (RL/OPD): You drive yourself; an instructor either gives you a score (RL) or tells you what to do in your current lane right now (OPD).

What they actually did

  • Model: Qwen3-0.6B-Base (a small LLM).
  • Target skill: solving math word problems (GSM8K).
  • “Retention” checks: making sure the model still knows general stuff (TruthfulQA and MMLU).
  • Runs they tried:
    • Mild SFT: gentle fine-tuning on GSM8K.
    • Stress SFT: very strong, repeated SFT to push the model hard.
    • OPD: student asks a teacher for short continuations from student’s own states.
    • RL: lightweight on-policy training with rewards based on correct final answers.
  • They also measured “state drift” (how much the model’s typical situations changed) with a statistical distance number (MMD). Think of it as checking how different the model’s new “roads” are from its old “roads.”

What they found and why it matters

  1. Gentle SFT helps without much forgetting
  • Mild SFT raised math performance and kept general knowledge nearly intact.
  • Lesson: SFT can be safe when you don’t push too hard and the dataset fits the model’s current behavior.
  1. Aggressive SFT can hurt both the target and general knowledge
  • Stress SFT caused significant “forgetting” on general tasks and even made math performance worse.
  • Why: Heavy off-policy pressure forces the model toward behaviors seen in the training examples, but doesn’t teach it to recover from its own mistakes or unusual situations.
  1. A student can beat a weak teacher with OPD
  • They first made a “degraded” teacher using stress SFT (it performed poorly).
  • Then they trained a student with OPD: the student generated its own partial answers and asked that weak teacher for local guidance.
  • Result: the student outperformed the teacher on math and on the general tests.
  • Why this is surprising and important: It shows that “copying the teacher” is not the full story. If you ask the teacher for help only in the student’s current situations, the student can avoid some of the teacher’s bad habits.
  1. On-policy RL improved math with little forgetting
  • Even with simple rewards, RL gave steady gains in math while preserving general knowledge.
  • Why: RL updates the model where it actually lives — on the states it really visits — so changes are more “local” and less destructive.
  1. A single “drift number” doesn’t explain forgetting
  • Two runs had similar drift sizes (their state distributions moved by a similar amount), but very different amounts of forgetting.
  • Takeaway: It’s not just how far you move, but where and how you apply supervision. State source and locality matter.

Why this matters going forward

  • Design principle: Pick training methods that apply supervision on the learner’s own states. This “locality” helps improve the target skill while keeping past knowledge intact.
  • Practical tip for distillation: If a teacher is flawed overall but still gives useful local advice, query it on the student’s states (OPD) and use short teacher continuations, not just one-step token matching.
  • For RL users: You don’t always need heavy, complex setups. Even lightweight on-policy RL can deliver stable improvements with low forgetting.
  • Bigger picture: When discussing post-training, don’t only name the loss (cross-entropy, KL, rewards). Always say where the supervision is applied — on dataset states, teacher states, or the student’s own states. That choice can decide whether training helps or harms.

A few plain-English definitions

  • State: The current situation during writing — the original question plus the words the model has already produced.
  • State distribution: The collection of situations the model tends to get into while writing.
  • Off-policy: Training on situations from someone else’s ideal answers, not from the model’s own writing path.
  • On-policy: Training on the model’s actual situations as it writes.
  • Distillation: Learning from a teacher model. OPD is distillation done on the student’s own states.
  • Forgetting vs. retention: Forgetting is losing old skills after new training; retention is how well those old skills are preserved.

In short: This paper shows that in post-training, where you supervise the model — the states — can be as important as what kind of supervision you use. Methods that train on the model’s own states (like RL and OPD) can improve the target skill while keeping general knowledge, and even let a student surpass a weak teacher.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, actionable list of what remains missing, uncertain, or unexplored in the paper.

  • Generalization beyond a tiny setup: validate the state-distribution thesis on larger models (e.g., 7B–70B), multiple architectures, and varied domains (coding, dialogue, safety-critical, multilingual) to assess external validity.
  • Budget-controlled comparisons: match compute, tokens, number of updates, and wall-clock across SFT, RL, and OPD to isolate the causal effect of the state source from update magnitude.
  • KL/trust-region control: hold fixed KL-to-base and/or total gradient norm across methods to test whether retention benefits stem from on-policy states rather than smaller or more conservative updates.
  • OPD ablations: systematically vary continuation length, number of teacher steps, decoding strategy (temperature, sampling vs greedy), and supervision density to explain why one-step OPD collapsed and to find minimal effective configurations.
  • When do students beat teachers?: formalize and empirically test criteria (e.g., teacher’s local conditional quality on student states vs. teacher occupancy failures) that predict teacher–student reversal under OPD.
  • Teacher quality sensitivity: map OPD outcomes across teacher quality levels (strong, mediocre, degraded), architectures, and sizes; quantify thresholds where OPD ceases to outperform the teacher.
  • Mixture state sources: explore hybrids that mix dataset states, student states, and teacher states (with schedules or curriculum), and measure effects on target gains and retention.
  • Dynamics over training: track performance, retention, and drift curves step-by-step (not only endpoints) to reveal when forgetting begins and how state distributions evolve during training.
  • Richer retention evaluation: extend beyond TruthfulQA and a partial MMLU subset to broader safety/alignment (toxicity, refusal, calibration), knowledge retrieval, and long-horizon multi-turn tasks.
  • Fine-grained forgetting: analyze category-level retention (per MMLU domain, reasoning subtype) to identify which capabilities are most vulnerable under off-policy pressure.
  • Out-of-distribution robustness: evaluate transfer to unseen math datasets (e.g., SVAMP, MAWPS, GSM-Hard), shifted prompt styles, and adversarial prompts to test stability of on-policy methods.
  • State-drift measurement: replace lexical features with hidden-state or encoder embeddings; compare MMD, Wasserstein, classifier two-sample tests; develop locality-aware drift metrics that weight states by reachability/recoverability.
  • Causal tests of “locality”: design experiments that swap only the state source while matching supervision, update scale, and KL to directly attribute retention differences to on-policy state application.
  • State coverage instrumentation: log which states receive updates (e.g., density-ratio estimation, per-prompt coverage maps) to quantify how SFT vs. OPD/RL distribute supervision across the rollout space.
  • RL method breadth: test PPO/GRPO variants, reference models, clipping, and KL penalties; measure how trust-region strength modulates retention vs. target performance under on-policy rewards.
  • Reward design sensitivity: probe sparse vs. dense/verifier rewards, reward noise, and credit assignment strategies to characterize sample efficiency and retention trade-offs for on-policy RL.
  • OPD cost profiling: quantify teacher-query cost (tokens, wall-clock) and compare cost–benefit curves against RL and SFT under equal budgets.
  • Scaling laws: study how state-source effects scale with model size, dataset size, LoRA rank, learning rate, and curriculum length; identify regimes where off-policy SFT becomes brittle.
  • Theoretical underpinnings: derive conditions where on-policy supervision guarantees local improvement with bounded retention loss; relate state-occupancy mismatch to forgetting bounds.
  • Operationalizing “locality” and “recoverability”: define measurable proxies (e.g., expected hitting time back to high-density regions, policy Jacobian norms on visited states) that predict forgetting risk.
  • Safety and alignment impacts: measure how state source affects refusal behavior, hallucination, bias, and calibration; test safety retention under targeted specialization pressure.
  • Cross-domain OPD: evaluate whether teacher guidance from a different domain or language remains beneficial when applied on student states; identify failure modes of domain-mismatched teachers.
  • Continual/multi-task post-training: test whether on-policy state sourcing reduces interference when adding tasks sequentially or training on multi-objective mixes.
  • Robustness to prompt distribution shift: train on one prompt style and evaluate on others to determine if on-policy methods better adapt to deployment-time prompt shifts.
  • Data leakage and overfitting checks: rule out leakage on GSM8K; assess generalization vs. memorization under stressed SFT and OPD.
  • Missing implementation details: provide full mild-SFT hyperparameters, seeds, prompt lists, and OPD/RL training settings to enable faithful reproduction and controlled follow-ups.

Practical Applications

Immediate Applications

  • Bold: Continuation-based on-policy distillation (OPD) for targeted capability gains. Description: Replace offline distillation or aggressive SFT with student-driven rollouts and short teacher continuations to repair trajectories locally, improving the target task while preserving general capabilities. Sectors: software, edtech, customer support, coding assistants. Tools/Workflows: “DAgger-style” OPD trainer; student samples prefixes, queries teacher for short continuations, learns with cross-entropy; LoRA adapters for efficient updates; rollout and logging infrastructure. Assumptions/Dependencies: Access to a reasonably competent teacher (can be a degraded SFT model); on-policy sampling capability; careful use of continuation-based supervision (one-step OPD may collapse per findings).
  • Bold: Lightweight on-policy RL with verifiable rewards. Description: Use exact-answer or programmatic/verifiable rewards (e.g., math, code tests) to improve performance with minimal forgetting, leveraging the locality of on-policy updates. Sectors: education (math tutors), software (code generation with unit tests), enterprise QA (rule-checkable outputs). Tools/Workflows: GRPO/PPO-lite trainers; verifiers (unit tests, regex/compilers, checkers); KL/reference constraints; LoRA for low-cost adaptation. Assumptions/Dependencies: Availability of reliable automatic rewards; sampling throughput; careful KL/clip settings to avoid instability.
  • Bold: Drift- and pressure-aware SFT (avoid “stress SFT”). Description: Implement SFT “pressure budgets” (epochs, LR, LoRA rank/alpha) with early stopping and drift/retention monitoring to prevent catastrophic forgetting while specializing. Sectors: software, enterprise model ops, safety/alignment teams. Tools/Workflows: Retention dashboards (TruthfulQA, MMLU); state-drift monitors (MMD plus complementary metrics); auto-early-stop triggers based on retention. Assumptions/Dependencies: Agreement on retention metrics; acceptance that scalar drift alone is insufficient—pair with state-source awareness.
  • Bold: Data curation that matches learner-induced states. Description: Filter/weight demonstrations whose prefixes resemble current model rollouts to reduce off-policy mismatch and exposure bias during SFT. Sectors: data engineering for LLMs, edtech dataset providers. Tools/Workflows: “State similarity” scoring between dataset prefixes and recent student rollouts; curriculum that refreshes with current state distribution. Assumptions/Dependencies: A fast way to embed/compare states (lexical or model embeddings); continuous logging of learner rollouts.
  • Bold: Teacher–student reversal for model compression. Description: Distill from an inexpensive, even degraded, fine-tuned teacher using OPD so the student can surpass the teacher by avoiding teacher’s poor trajectory habits. Sectors: mobile/edge, healthcare devices, embedded assistants. Tools/Workflows: Small student model trained on-policy with teacher continuations; deployment on-device with improved retention vs offline KD. Assumptions/Dependencies: Teacher provides locally useful guidance; compute budget for on-policy rollouts; privacy controls for logged states.
  • Bold: On-policy safety patching. Description: Sample the student’s risky/harmful states and query a safety-tuned teacher (or rules) for local repair, reducing harmful completions without over-regularizing benign behavior. Sectors: safety/alignment, content moderation, regulated industries. Tools/Workflows: Safety teacher/policy; red-team rollout sampler; OPD updates on flagged states; post-training guardrail checks. Assumptions/Dependencies: Access to a reliable safety teacher or rule set; robust detection of risky states; audit trails.
  • Bold: Hybrid post-training schedules (mild SFT → OPD/RL). Description: Warm start with gentle SFT to get capability, then switch to on-policy OPD/RL to push performance while preserving retention. Sectors: general LLM productization, internal enterprise assistants. Tools/Workflows: Two-phase pipeline with budgeted SFT followed by OPD/RL; automated phase-switch criteria using retention and drift signals. Assumptions/Dependencies: Monitoring stack; organizational willingness to split training phases.
  • Bold: Product “bug triage” via on-policy rollouts. Description: Log real user prompts and model prefixes that lead to failures, then collect human/teacher corrections on those exact states for OPD updates—turn production failures into targeted fixes. Sectors: SaaS chatbots, customer support automation, developer copilots. Tools/Workflows: Telemetry for failure state capture; lightweight human-in-the-loop labeling UI; nightly OPD fine-tunes. Assumptions/Dependencies: Data privacy/consent; selection of safe and representative failure states; cost control for teacher/human supervision.
  • Bold: Tool-use and agent training with OPD. Description: Train tool-calling or multi-step agents by supervising teacher continuations from the agent’s own partial trajectories to correct tool selection and argument formatting. Sectors: software automation, RAG/agents, operations. Tools/Workflows: OPD with short teacher continuations including tool calls; verifiers for tool outputs; curriculum over tool catalogs. Assumptions/Dependencies: Tooling APIs and verifiers; ability to capture intermediate agent states; safe tool sandboxing.
  • Bold: Robotics/IL teams reinforce on-policy expert querying. Description: Re-emphasize DAgger-style data collection: query experts on robot-induced states instead of offline cloning, aligning with the paper’s state-centric view. Sectors: robotics, autonomous systems. Tools/Workflows: Learner-driven data aggregation; local corrections on encountered states; safety cages. Assumptions/Dependencies: Expert availability; safe rollouts; sim2real transfer.

Long-Term Applications

  • Bold: State-distribution-aware auto-tuners. Description: Adaptive controllers that decide when to use SFT vs OPD vs RL and how intensely to apply them, based on live estimates of state compatibility, retention, and performance. Sectors: platform MLOps, foundation model providers. Tools/Workflows: Closed-loop schedulers; online state-distribution analytics; multi-objective optimization. Assumptions/Dependencies: Reliable online metrics; policy for automatic switching; robust rollback.
  • Bold: Rich, standardized state metrics and audits. Description: Beyond lexical MMD, develop embedding- or hidden-state-based drift and “local recoverability” measures; mandate reporting of state-source choices in model cards and evaluations. Sectors: policy/regulation, standards bodies, academia. Tools/Workflows: Open benchmarks and libs for state drift; audit protocols; “state-source disclosure” sections. Assumptions/Dependencies: Community consensus; privacy-preserving logging; regulator adoption.
  • Bold: General verifiers and weak-signal scaffolds for RL. Description: Extend on-policy RL to domains without trivial rewards by building semi-automated verifiers (consistency checks, simulators, self-verification) to supply stable feedback. Sectors: healthcare (clinical reasoning simulators), finance (compliance rule checks), legal (citation correctness). Tools/Workflows: Domain-specific verifiers; hybrid preference + verifiable rewards; sandboxed evaluation environments. Assumptions/Dependencies: Domain expertise; liability and validation; simulator fidelity.
  • Bold: On-device continual OPD personalization. Description: Privacy-preserving, incremental OPD that learns from a user’s own interaction states and a local/remote teacher to personalize without catastrophic forgetting. Sectors: mobile, productivity, assistive tech. Tools/Workflows: Federated/on-device adapters; episodic memory to retain capabilities; teacher-as-a-service with differential privacy. Assumptions/Dependencies: Efficient on-device training; DP/privacy guarantees; intermittent connectivity.
  • Bold: Teacher-as-a-Service for on-policy continuation. Description: APIs that accept student prefixes and return short continuations optimized for OPD, enabling smaller labs/apps to benefit from large teachers without copying full trajectories. Sectors: AI platforms, startups. Tools/Workflows: Low-latency continuation API; cost-aware sampling (top-k/temp); caching across similar states. Assumptions/Dependencies: IP/safety controls; rate limits; economic viability.
  • Bold: Cross-modal state-centric post-training. Description: Apply the state-source framework to VLMs, speech models, and code agents, using on-policy states (e.g., visual context plus partial text) to reduce forgetting while specializing. Sectors: multimodal assistants, AR/VR, autonomous driving. Tools/Workflows: State encoders that span modalities; continuation-based supervision with multimodal teachers. Assumptions/Dependencies: Unified state representations; multimodal teacher availability.
  • Bold: Safety red-teaming loops at scale. Description: Continuous on-policy red-team generation of adversarial states and OPD/RL patching with safety teachers to reduce harmful behaviors without broad performance regression. Sectors: safety/alignment, policy compliance. Tools/Workflows: Adversarial state generators; prioritized OPD queues; post-patch monitoring. Assumptions/Dependencies: High-quality safety objectives; governance for trade-offs; evaluation suites.
  • Bold: Education: personalized tutors with on-policy RL/OPD. Description: Tutors that adapt to each learner’s mistakes (their states), using verifiable problem checkers and local corrections to improve reasoning while retaining breadth. Sectors: edtech. Tools/Workflows: Problem verifiers; teacher models specialized in pedagogy; per-learner adapter checkpoints. Assumptions/Dependencies: Robust verifiers across curricula; safeguarding; efficacy studies.
  • Bold: Regulated-domain domain adaptation with retention guarantees. Description: Use on-policy OPD/RL to adapt general models to medical/legal subdomains while preserving general knowledge and safety, audited via standardized retention suites. Sectors: healthcare, finance, legal. Tools/Workflows: Domain teachers; retention compliance gates; documentation of state-source choices. Assumptions/Dependencies: Regulatory approval; validated domain datasets; auditability.
  • Bold: Marketplace for on-policy data collection. Description: Economies around collecting and curating learner-induced states plus human/teacher corrections, producing higher-yield data than static demonstrations. Sectors: data providers, AI tooling. Tools/Workflows: Secure state logging pipelines; annotator UIs tailored to partial prefixes; quality scoring by local recoverability. Assumptions/Dependencies: Privacy and consent; pricing models; quality control.

Notes across applications

  • The paper’s key dependency is the availability and quality of the teacher signal when using OPD; continuation-based supervision is critical for reasoning tasks, whereas one-step matching may fail.
  • Results are demonstrated on a small model (Qwen3-0.6B) and a limited setup; scaling to larger models/tasks requires validation.
  • Scalar drift (e.g., MMD) is informative but insufficient alone; pair drift monitoring with retention and an explicit accounting of the state source (who generates the prefixes) to assess risk of forgetting.

Glossary

  • Autoregressive policy: A model that generates tokens sequentially, with each prediction conditioned on the previous context. "For an autoregressive policy, a state is a prompt plus generated prefix."
  • Catastrophic forgetting: Loss of previously learned capabilities when a model is specialized aggressively on new data. "yet it can cause catastrophic forgetting or brittle behavior under aggressive specialization."
  • DAgger: An imitation-learning algorithm that queries an expert on learner-induced states to mitigate compounding errors. "This is analogous to DAgger-style learning: the learner controls the state distribution, while an expert-like source provides local repair signals [22]."
  • Direct Preference Optimization (DPO): A preference-learning method that removes the online RL loop by expressing preference optimization as a supervised objective. "Preference-optimization methods such as DPO remove the explicit online RL loop and express preference learning as a supervised objective [20];"
  • Exposure bias: A mismatch where models are trained on gold prefixes but tested on their own generated prefixes, causing errors to compound. "This is the familiar exposure-bias problem recast as a state-distribution mismatch."
  • Forward KL: The Kullback–Leibler divergence measured as KL(p||q), often encouraging coverage of the target distribution. "forward KL, reverse KL, or related objective-level variants."
  • Group-relative exact-answer reward: A reward signal that scores outputs by comparing them within a group based on exact-answer correctness. "RL uses group-relative exact-answer reward on GSM8K rollouts."
  • Knowledge distillation: Transferring behavior from a teacher model to a student via soft targets or generated supervision. "distillation transfers behavior from a teacher model to a student."
  • LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique that injects trainable low-rank adapters into a pretrained model. "All experiments use Qwen3-0.6B-Base with LoRA adapters [12] on a single RTX 3090 24GB GPU."
  • Maximum mean discrepancy (MMD): A kernel-based statistical distance used to compare distributions, here for state drift measurement. "report maximum mean discrepancy (MMD) with an RBF kernel [9]."
  • Off-policy: Using states or trajectories not induced by the current learner’s policy during training. "The supervision is dense, but the states are off-policy."
  • On-policy: Using states or trajectories sampled from the current learner’s policy during training, emphasizing local updates. "The on-policy property gives RL a different failure mode from SFT."
  • On-Policy Distillation (OPD): A method where the student samples states and the teacher provides supervision on those states. "In OPD, the student samples states and the teacher provides supervision:"
  • Policy gradient: A class of reinforcement learning methods that optimize expected reward by estimating gradients of the policy. "RL is reward maximization with policy-gradient-style updates;"
  • Proximal Policy Optimization (PPO): A policy-gradient algorithm that constrains updates for stability via clipping-like mechanisms. "Policy-gradient post-training commonly builds on PPO-style conservative policy optimization [24]"
  • RBF kernel: A radial basis function kernel used in kernel methods to measure similarity, here within MMD. "report maximum mean discrepancy (MMD) with an RBF kernel [9]."
  • Reverse KL: The Kullback–Leibler divergence measured as KL(q||p), often encouraging mode-seeking behavior. "forward KL, reverse KL, or related objective-level variants."
  • Rollout: A sampled sequence of states and actions (tokens) generated by a policy when interacting with prompts. "we approximate it with rollouts and sampled prefixes."
  • Sliced Wasserstein distance: A metric for comparing probability distributions by projecting them onto 1D slices and aggregating Wasserstein distances. "We also compute centroid distance, sliced Wasserstein distance [19], and lexical Jaccard distance"
  • State visitation distribution: The distribution over states that a policy encounters when generating trajectories. "Let d™ (s) denote the state visitation distribution induced by policy 7 on a prompt distribution."
  • Supervised fine-tuning (SFT): Training a model to imitate target outputs on fixed dataset states using supervised losses. "SFT minimizes token loss on dataset states:"
  • Trajectory: A sequence of states and actions (tokens) that represent a complete generated answer or interaction. "A generated answer is a trajectory through states."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 86 likes about this paper.