Papers
Topics
Authors
Recent
Search
2000 character limit reached

StableOPD: Stabilized On-Policy Distillation

Updated 4 July 2026
  • StableOPD is a stabilized formulation of on-policy distillation that enhances large language model post-training by mitigating high variance and instability.
  • It integrates techniques like detached control-variate baselines, rollout mixture, and bounded rewards to address statistical, behavioral, and systems-induced instabilities.
  • Empirical results show significant improvements in accuracy and compute efficiency over vanilla OPD, underscoring its importance in LLM optimization.

Searching arXiv for the cited StableOPD-related papers to ground the article in recent literature. arXiv search query: (Oh et al., 8 May 2026) arXiv search query: "On-Policy Distillation" stable OPD vOPD f-OPD PowerOPD DOPD StableOPD denotes stabilized formulations of On-Policy Distillation (OPD) in LLM post-training. In OPD, a student policy πθ\pi_\theta is trained on trajectories sampled from its own distribution while receiving dense token-level supervision from a stronger teacher πt\pi_t. Recent work uses the label for a family of stabilization strategies rather than a single canonical loss: detached control-variate baselines, reference-anchored regularization, rollout-mixture distillation, freshness-aware weighting for asynchronous pipelines, bounded reward transformations, and privilege-aware token routing have all been proposed to preserve OPD’s on-policy advantages while reducing variance, preventing collapse, and improving compute-efficiency (Oh et al., 8 May 2026, Luo et al., 9 Apr 2026, Chen et al., 18 May 2026, Zhao et al., 15 Jun 2026, Yu et al., 29 Jun 2026).

1. Core formulation within on-policy distillation

OPD trains a student policy on the distribution induced by its own generations. For a prompt xDx\sim\mathcal D, a student response y=(y1,,yy)πθ(x)y=(y_1,\dots,y_{|y|})\sim \pi_\theta(\cdot\mid x), and token context ct=(x,y<t)c_t=(x,y_{<t}), the reverse-KL objective can be written as

DKL ⁣(πθπt)=ExD,yπθ(x)[logπθ(yx)πt(yx)],\mathbb{D}_{\mathrm{KL}}\!\left(\pi_\theta \,\Vert\, \pi_t\right) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot\mid x)} \left[\log \frac{\pi_\theta(y\mid x)}{\pi_t(y\mid x)}\right],

or, equivalently, as the token-level maximization objective

JOPD(θ)=ExD,yπθ(x)[t=1ylogπt(ytct)πθ(ytct)].\mathcal{J}_{\text{OPD}}(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y \sim \pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{|y|} \log \frac{\pi_t(y_t \mid c_t)}{\pi_\theta(y_t \mid c_t)}\right].

With stop-gradient on the reward, the per-token signal is

rt(ct,yt)=logπt(ytct)logπθ(ytct),r_t(c_t,y_t)=\log \pi_t(y_t \mid c_t)-\log \pi_\theta(y_t \mid c_t),

and the corresponding REINFORCE-style gradient is

θJOPD(θ)=E[t=1yrt(ct,yt)θlogπθ(ytct)].\nabla_\theta \mathcal{J}_\text{OPD}(\theta) = \mathbb{E}\left[\sum_{t=1}^{|y|} r_t(c_t,y_t)\,\nabla_\theta \log \pi_\theta(y_t \mid c_t)\right].

This single-sample Monte Carlo estimator is the form used in practice (Oh et al., 8 May 2026).

OPD became prominent because it provides dense, token-level training signals rather than sparse terminal rewards, and because it mitigates exposure bias by training on the student’s own rollout distribution. In reasoning domains, this makes it faster and more compute-efficient than RL with verifiable rewards for long chain-of-thought reasoning while achieving similar accuracy in practice (Oh et al., 8 May 2026). A separate formulation uses a GRPO-style clipped objective with the same token-level reverse-KL advantage, replacing sequence-level advantages by A(s,y)=logTT(ys)logTθ(ys)A(s,y)=\log T_T(y\mid s)-\log T_\theta(y\mid s) (Luo et al., 9 Apr 2026).

2. Sources of instability

The central instability in vanilla OPD arises from the single-sample reward estimator. Because the log-ratio reward is unbounded, the estimator can exhibit very high variance, especially when the teacher assigns very low probability to student-sampled tokens. Both variance-reduction and reward-diagnosis papers report a pronounced heavy negative tail in the reward distribution, noisy gradients, loss volatility, and unstable generation dynamics; early positions are especially problematic because large prefix updates alter subsequent rollout distributions (Oh et al., 8 May 2026, Zhao et al., 15 Jun 2026).

A second failure mode is behavioral rather than purely statistical. One study identifies abrupt length inflation in student rollouts, followed by truncation collapse and repetition saturation. Training data then become dominated by trajectories that hit the generation budget rather than terminate with EOS, and repetitive suffix tokens receive disproportionately large reverse-KL advantages. The paper formalizes this with training-time diagnostics such as truncation rate and a compression-based repetition rate, and attributes the pathology to a self-reinforcing interaction between student-induced state visitation and the OPD objective (Luo et al., 9 Apr 2026).

A third instability is systems-induced staleness. Under asynchronous execution, the learner optimizes on buffered samples produced by older student policies and graded under outdated teacher contexts. This discrepancy is decomposed into rollout drift, πt\pi_t0, and supervision drift, πt\pi_t1, both of which worsen with lag and interaction horizon (Chen et al., 18 May 2026).

A fourth failure mode appears when distillation uses privileged inputs. DOPD identifies “privilege illusion”: the apparent teacher-student gap may conflate a transferable capability gap with an information asymmetry gap that the deployment-time student cannot reproduce. Uniform token-level distillation in that setting can trigger rapid entropy collapse, reduced exploration, and poor late-stage performance (Yu et al., 29 Jun 2026).

3. Control-variate stabilization: vOPD and exact value baselines

A major stabilization line recasts OPD as policy-gradient reinforcement learning and introduces a control variate baseline. For any action-independent baseline πt\pi_t2, the advantage becomes

πt\pi_t3

and the gradient remains unbiased because

πt\pi_t4

This requires the baseline to be detached, i.e., used with stop-gradient (Oh et al., 8 May 2026).

The distinctive result of vOPD is that OPD’s canonical value baseline has a closed form: πt\pi_t5 Thus the per-token reverse KL, already available from the forward pass, acts as an exact value function. The resulting vOPD advantage is

πt\pi_t6

with the KL term detached. This preserves the lightweight backward pass of vanilla OPD: gradients still flow only through πt\pi_t7 for the sampled token (Oh et al., 8 May 2026).

The variance-reduction mechanism is structurally targeted. When student and teacher strongly disagree, the per-token reverse KL is large and positive, shifting the heavy negative tail of πt\pi_t8 toward zero. The paper gives a simplified variance reduction estimate proportional to

πt\pi_t9

so the largest benefit appears precisely at high-mismatch contexts (Oh et al., 8 May 2026).

This formulation differs sharply from full-vocabulary reverse-KL losses. Full-vocabulary losses have zero estimator variance at a fixed context but require backward propagation through all xDx\sim\mathcal D0 logits; top-xDx\sim\mathcal D1 loss variants reduce cost but bias the objective because mass outside the support is omitted and probabilities are renormalized. vOPD instead uses the KL only as a detached baseline. Even a top-xDx\sim\mathcal D2 approximation of the baseline,

xDx\sim\mathcal D3

remains unbiased because the baseline is still action-independent (Oh et al., 8 May 2026).

4. Reference anchoring, rollout mixture, and bounded rewards

Another line of work uses “StableOPD” for a specific stabilized objective that combines a fixed-reference divergence constraint with rollout mixture distillation. Let xDx\sim\mathcal D4 be a fixed reference policy, typically the initial student checkpoint, and xDx\sim\mathcal D5 a set of complete, high-quality chains of thought. The mixture objective is

xDx\sim\mathcal D6

and the full stabilized objective is

xDx\sim\mathcal D7

The KL term acts as a trust-region anchor, while the golden-data mixture prevents truncated and repetitive on-policy trajectories from monopolizing the update (Luo et al., 9 Apr 2026).

This formulation directly targets the length-inflation pathology. The paper reports that under vanilla OPD, teacher and student log-probabilities become much less negative around inflation onset, the teacher log-probability increases more than the student’s, and the average reverse-KL advantage spikes. Under the stabilized objective, rollout lengths remain controlled, truncation rates stay moderate, repetition rates remain near zero, and validation accuracy improves steadily (Luo et al., 9 Apr 2026).

A distinct but related approach replaces the unbounded log-ratio reward itself. PowerOPD introduces a Box-Cox family

xDx\sim\mathcal D8

and defines the bounded, sign-consistent token reward

xDx\sim\mathcal D9

Since y=(y1,,yy)πθ(x)y=(y_1,\dots,y_{|y|})\sim \pi_\theta(\cdot\mid x)0 for y=(y1,,yy)πθ(x)y=(y_1,\dots,y_{|y|})\sim \pi_\theta(\cdot\mid x)1, the reward is bounded in y=(y1,,yy)πθ(x)y=(y_1,\dots,y_{|y|})\sim \pi_\theta(\cdot\mid x)2, and as y=(y1,,yy)πθ(x)y=(y_1,\dots,y_{|y|})\sim \pi_\theta(\cdot\mid x)3 it recovers the log-ratio limit. The practical effect is to cap reward magnitude before any post-hoc scaling and suppress the extreme early-token updates characteristic of vanilla OPD (Zhao et al., 15 Jun 2026).

These two stabilization families are complementary in emphasis. Reference anchoring and rollout mixture constrain the student’s behavioral drift and data distribution, whereas bounded rewards reshape the estimator itself. The literature suggests that both lines preserve OPD’s on-policy structure while attacking different parts of the same instability loop.

5. Asynchrony, freshness, and privilege-aware routing

For long-horizon agents, stabilization must also address stale samples. f-OPD formalizes the discrepancy between asynchronous and ideal synchronous OPD, then defines per-sample rollout and supervision diagnostics: y=(y1,,yy)πθ(x)y=(y_1,\dots,y_{|y|})\sim \pi_\theta(\cdot\mid x)4

y=(y1,,yy)πθ(x)y=(y_1,\dots,y_{|y|})\sim \pi_\theta(\cdot\mid x)5

With lag y=(y1,,yy)πθ(x)y=(y_1,\dots,y_{|y|})\sim \pi_\theta(\cdot\mid x)6, the surrogate discrepancy is

y=(y1,,yy)πθ(x)y=(y_1,\dots,y_{|y|})\sim \pi_\theta(\cdot\mid x)7

and the freshness score is

y=(y1,,yy)πθ(x)y=(y_1,\dots,y_{|y|})\sim \pi_\theta(\cdot\mid x)8

The final objective weights samples by y=(y1,,yy)πθ(x)y=(y_1,\dots,y_{|y|})\sim \pi_\theta(\cdot\mid x)9, adds a rollout-anchored regularizer ct=(x,y<t)c_t=(x,y_{<t})0, and triggers buffer refresh when mean freshness or alignment coverage deteriorate (Chen et al., 18 May 2026).

This construction treats stale-sample reliability as an explicit optimization variable. Rollout drift and supervision drift are no longer hidden systems artifacts but measurable inputs to sample selection and trust-region control. In long-horizon coding and tool-use tasks, that is the key distinction between statistically “on-policy” and operationally on-policy optimization (Chen et al., 18 May 2026).

DOPD addresses a different setting: distillation with privileged inputs. It defines the token-level privilege advantage gap

ct=(x,y<t)c_t=(x,y_{<t})1

then routes each token into one of four regimes: low-gap/high-confidence (LH), low-gap/low-confidence (LL), high-gap/teacher-dominant (HT), or high-gap/student-dominant (HS). HT tokens receive strong full-vocabulary Jensen–Shannon supervision from the privileged teacher; LH, LL, and HS tokens receive lighter Top-ct=(x,y<t)c_t=(x,y_{<t})2 reverse-KL objectives, often to a stop-gradient privileged-student anchor rather than the teacher (Yu et al., 29 Jun 2026).

The conceptual point is that token-level supervision is non-uniform. DOPD treats some tokens as capability-bearing and others as privilege-heavy, noisy, or exploratory. This calibrated routing is explicitly designed to avoid privilege imitation, entropy collapse, and over-regularization while still extracting teacher signal where it is credible (Yu et al., 29 Jun 2026).

6. Empirical profile and implementation conventions

Across the 2026 OPD literature, stabilized variants consistently outperform vanilla OPD while preserving much of OPD’s single-sample efficiency. The reported gains are method-specific rather than uniform, but they are large enough to establish stabilization as a first-order design concern rather than a peripheral optimization trick.

Variant Stabilization mechanism Representative reported result
StableOPD Reference KL + rollout mixture distillation Qwen2.5-Math-1.5B average accuracy improves from ct=(x,y<t)c_t=(x,y_{<t})3 to ct=(x,y<t)c_t=(x,y_{<t})4 (Luo et al., 9 Apr 2026)
vOPD Detached KL value baseline Up to ct=(x,y<t)c_t=(x,y_{<t})5 wall-clock time reduction relative to full-vocabulary OPD (Oh et al., 8 May 2026)
f-OPD Freshness weighting + rollout anchor + adaptive refresh Throughput ct=(x,y<t)c_t=(x,y_{<t})6 with coding resolve ct=(x,y<t)c_t=(x,y_{<t})7 vs ct=(x,y<t)c_t=(x,y_{<t})8 for synchronous OPD (Chen et al., 18 May 2026)
PowerOPD Bounded Box-Cox reward ct=(x,y<t)c_t=(x,y_{<t})9 wall-clock reduction and DKL ⁣(πθπt)=ExD,yπθ(x)[logπθ(yx)πt(yx)],\mathbb{D}_{\mathrm{KL}}\!\left(\pi_\theta \,\Vert\, \pi_t\right) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot\mid x)} \left[\log \frac{\pi_\theta(y\mid x)}{\pi_t(y\mid x)}\right],0 lower peak GPU memory than full-vocabulary OPD (Zhao et al., 15 Jun 2026)
DOPD Advantage-aware dual-source routing LLM average DKL ⁣(πθπt)=ExD,yπθ(x)[logπθ(yx)πt(yx)],\mathbb{D}_{\mathrm{KL}}\!\left(\pi_\theta \,\Vert\, \pi_t\right) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot\mid x)} \left[\log \frac{\pi_\theta(y\mid x)}{\pi_t(y\mid x)}\right],1, recovering DKL ⁣(πθπt)=ExD,yπθ(x)[logπθ(yx)πt(yx)],\mathbb{D}_{\mathrm{KL}}\!\left(\pi_\theta \,\Vert\, \pi_t\right) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot\mid x)} \left[\log \frac{\pi_\theta(y\mid x)}{\pi_t(y\mid x)}\right],2 of the original student-teacher gap (Yu et al., 29 Jun 2026)

More granular results reinforce the same pattern. vOPD reports up to DKL ⁣(πθπt)=ExD,yπθ(x)[logπθ(yx)πt(yx)],\mathbb{D}_{\mathrm{KL}}\!\left(\pi_\theta \,\Vert\, \pi_t\right) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot\mid x)} \left[\log \frac{\pi_\theta(y\mid x)}{\pi_t(y\mid x)}\right],3 absolute on MATH500 for Qwen3-1.7BDKL ⁣(πθπt)=ExD,yπθ(x)[logπθ(yx)πt(yx)],\mathbb{D}_{\mathrm{KL}}\!\left(\pi_\theta \,\Vert\, \pi_t\right) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot\mid x)} \left[\log \frac{\pi_\theta(y\mid x)}{\pi_t(y\mid x)}\right],4Base, around DKL ⁣(πθπt)=ExD,yπθ(x)[logπθ(yx)πt(yx)],\mathbb{D}_{\mathrm{KL}}\!\left(\pi_\theta \,\Vert\, \pi_t\right) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot\mid x)} \left[\log \frac{\pi_\theta(y\mid x)}{\pi_t(y\mid x)}\right],5 on MATH500 at 4B scale, and DKL ⁣(πθπt)=ExD,yπθ(x)[logπθ(yx)πt(yx)],\mathbb{D}_{\mathrm{KL}}\!\left(\pi_\theta \,\Vert\, \pi_t\right) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot\mid x)} \left[\log \frac{\pi_\theta(y\mid x)}{\pi_t(y\mid x)}\right],6 average versus DKL ⁣(πθπt)=ExD,yπθ(x)[logπθ(yx)πt(yx)],\mathbb{D}_{\mathrm{KL}}\!\left(\pi_\theta \,\Vert\, \pi_t\right) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot\mid x)} \left[\log \frac{\pi_\theta(y\mid x)}{\pi_t(y\mid x)}\right],7 for OPD on Olmo-3-7B (Oh et al., 8 May 2026). StableOPD reports DKL ⁣(πθπt)=ExD,yπθ(x)[logπθ(yx)πt(yx)],\mathbb{D}_{\mathrm{KL}}\!\left(\pi_\theta \,\Vert\, \pi_t\right) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot\mid x)} \left[\log \frac{\pi_\theta(y\mid x)}{\pi_t(y\mid x)}\right],8 average accuracy on Qwen2.5-Math-7B, surpassing SFT, GRPO, OPD, and several RLVR baselines (Luo et al., 9 Apr 2026). PowerOPD reports benchmark-averaged Avg@8/Pass@8 gains of up to DKL ⁣(πθπt)=ExD,yπθ(x)[logπθ(yx)πt(yx)],\mathbb{D}_{\mathrm{KL}}\!\left(\pi_\theta \,\Vert\, \pi_t\right) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot\mid x)} \left[\log \frac{\pi_\theta(y\mid x)}{\pi_t(y\mid x)}\right],9 over vanilla OPD, JOPD(θ)=ExD,yπθ(x)[t=1ylogπt(ytct)πθ(ytct)].\mathcal{J}_{\text{OPD}}(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y \sim \pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{|y|} \log \frac{\pi_t(y_t \mid c_t)}{\pi_\theta(y_t \mid c_t)}\right].0 over post-hoc stabilization, and JOPD(θ)=ExD,yπθ(x)[t=1ylogπt(ytct)πθ(ytct)].\mathcal{J}_{\text{OPD}}(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y \sim \pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{|y|} \log \frac{\pi_t(y_t \mid c_t)}{\pi_\theta(y_t \mid c_t)}\right].1 over full-vocabulary OPD, while keeping gradient norms more than JOPD(θ)=ExD,yπθ(x)[t=1ylogπt(ytct)πθ(ytct)].\mathcal{J}_{\text{OPD}}(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y \sim \pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{|y|} \log \frac{\pi_t(y_t \mid c_t)}{\pi_\theta(y_t \mid c_t)}\right].2 smaller than vanilla OPD’s initial spike (Zhao et al., 15 Jun 2026).

Implementation conventions are also converging. Detached rewards or baselines are a strict requirement whenever unbiasedness depends on action independence. Teacher policies are frozen. Several papers emphasize that gradients should not flow through teacher log-probabilities, baseline KL terms, or privileged-student anchors (Oh et al., 8 May 2026, Yu et al., 29 Jun 2026). Reported systems commonly use AdamW, rollout temperature JOPD(θ)=ExD,yπθ(x)[t=1ylogπt(ytct)πθ(ytct)].\mathcal{J}_{\text{OPD}}(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y \sim \pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{|y|} \log \frac{\pi_t(y_t \mid c_t)}{\pi_\theta(y_t \mid c_t)}\right].3, and top-JOPD(θ)=ExD,yπθ(x)[t=1ylogπt(ytct)πθ(ytct)].\mathcal{J}_{\text{OPD}}(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y \sim \pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{|y|} \log \frac{\pi_t(y_t \mid c_t)}{\pi_\theta(y_t \mid c_t)}\right].4 approximations when full-vocabulary computation is too expensive (Oh et al., 8 May 2026, Luo et al., 9 Apr 2026).

The literature therefore suggests a practical taxonomy. If the main issue is estimator variance, value baselines and bounded rewards dominate. If the main issue is behavioral drift, repetition, or truncation collapse, reference anchoring and rollout mixture are more direct. If the bottleneck is pipeline staleness or privileged supervision, freshness diagnostics and token routing become primary.

7. Terminological breadth beyond LLM distillation

Outside LLM post-training, the label “StableOPD” is not standardized. Some adjacent papers use it only conceptually, but they preserve the same broad idea: an optimization or operator-learning procedure whose defining property is explicit stability control.

In neural PDE operator learning, StablePDENet is described as a practical instantiation of what one might call a “StableOPD” approach. It learns a solution operator JOPD(θ)=ExD,yπθ(x)[t=1ylogπt(ytct)πθ(ytct)].\mathcal{J}_{\text{OPD}}(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y \sim \pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{|y|} \log \frac{\pi_t(y_t \mid c_t)}{\pi_\theta(y_t \mid c_t)}\right].5 under norm-bounded perturbations, enforces a min-max adversarial objective, and connects robustness to a bound on the Fréchet derivative JOPD(θ)=ExD,yπθ(x)[t=1ylogπt(ytct)πθ(ytct)].\mathcal{J}_{\text{OPD}}(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y \sim \pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{|y|} \log \frac{\pi_t(y_t \mid c_t)}{\pi_\theta(y_t \mid c_t)}\right].6, yielding Lipschitz-type stability JOPD(θ)=ExD,yπθ(x)[t=1ylogπt(ytct)πθ(ytct)].\mathcal{J}_{\text{OPD}}(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y \sim \pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{|y|} \log \frac{\pi_t(y_t \mid c_t)}{\pi_\theta(y_t \mid c_t)}\right].7 (Huang et al., 10 Jan 2026). In reduced-order modeling, StabOp replaces fixed spatial filters in Leray ROM stabilization by a learned operator JOPD(θ)=ExD,yπθ(x)[t=1ylogπt(ytct)πθ(ytct)].\mathcal{J}_{\text{OPD}}(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y \sim \pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{|y|} \log \frac{\pi_t(y_t \mid c_t)}{\pi_\theta(y_t \mid c_t)}\right].8, optimized by PDE-constrained training for a specified quantity of interest and resolution (Tsai et al., 8 Feb 2026). In chaotic dynamics, an operator-theoretic pipeline for detecting, identifying, and stabilizing unstable periodic orbits is explicitly organized under a “StableOPD” perspective combining delay-coordinate kernel operators, Koopman eigenfunctions, and interpretable control (Tavasoli et al., 2023).

Power-systems papers use the term in yet another sense: stability-aware optimal dispatch or flow. For DC networks, a “StableOPD” formulation optimizes generator setpoints while robustly guaranteeing feasibility and local exponential stability for all loads in an uncertainty set, via convex inner approximations of the stability set and tractable SDP/QCQP reductions (Liu et al., 2019). For AC networks, a Gaussian-process-based stability-constrained ACOPF embeds a probabilistic rotor-angle stability surrogate inside the dispatch problem through chance constraints of the form

JOPD(θ)=ExD,yπθ(x)[t=1ylogπt(ytct)πθ(ytct)].\mathcal{J}_{\text{OPD}}(\theta) = \mathbb{E}_{x\sim\mathcal{D},\, y \sim \pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{|y|} \log \frac{\pi_t(y_t \mid c_t)}{\pi_\theta(y_t \mid c_t)}\right].9

thereby enforcing dynamic safety without online swing-equation simulation (Vito et al., 30 Jul 2025).

A plausible implication is that “StableOPD” has become a portable descriptor for stability-aware optimization under learned or approximate operators, but in current arXiv usage its most developed and technically specific meaning remains stabilized on-policy distillation for LLMs.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to StableOPD.