Papers
Topics
Authors
Recent
2000 character limit reached

ORPO-Distill: Cross-Arch LLM Distillation

Updated 9 December 2025
  • ORPO-Distill is a cross-architecture LLM distillation framework that leverages diverse teacher reasoning traces to improve student performance.
  • It employs an odds-ratio preference optimization that contrasts correct teacher traces with student-generated negatives to enhance generalization.
  • The mixed-policy negative sampling strategy, combining off-policy and on-policy approaches, yields significant accuracy gains across various datasets.

ORPO-Distill is a general-purpose framework for cross-architecture distillation of LLMs, defining the distillation process as a preference optimization over reasoning traces. Unlike standard chain-of-thought (CoT) fine-tuning, which targets a single teacher trace, ORPO-Distill leverages diverse teacher-generated solution paths and contrastively ranks those against negative student generations, using a mixed-policy scheme for sampling negatives. This approach yields consistent and substantial accuracy improvements over traditional black-box knowledge distillation techniques, as validated across QA and math datasets with multiple student model architectures (Singh et al., 29 Sep 2025).

1. Core Motivation and Problem Framing

ORPO-Distill arises from critical limitations in conventional cross-architecture LLM distillation. Standard CoT distillation methods maximize the likelihood of a single “correct” teacher trace per prompt, often overfitting the student to particular solution patterns and hindering generalization to alternative reasoning trajectories. Simultaneously, white-box knowledge distillation (KD) approaches demonstrate that exposing the student to its own “negative” outputs—typically incorrect generations—ameliorates training-inference distribution shift, particularly since inference leverages the student’s own outputs.

To address these issues, ORPO-Distill formulates distillation as a preference optimization task over reasoning triplets for each prompt xx:

  • ypy_p: Teacher CoT trace yielding a correct final answer (positive)
  • yny_n: Student’s CoT trace resulting in an incorrect answer (negative) The student is trained to both imitate the teacher (via likelihood maximization) and explicitly prefer correct teacher traces over its own incorrect generations through contrastive ranking.

2. Odds-Ratio Preference Optimization Objective

The central mathematical formulation of ORPO-Distill combines supervised fine-tuning with a preference-ranking penalty. For a prompt xx, teacher trace ypy_p, and student trace yny_n:

  • Let qθ(yx)q_\theta(y|x) denote the student’s sequence probability.
  • Supervised Fine-Tuning Loss: LSFT(x,yp)=logqθ(ypx)L_{SFT}(x, y_p) = -\log q_\theta(y_p|x)
  • Odds of a sequence: oddsθ(yx)=qθ(yx)/(1qθ(yx))\text{odds}_\theta(y|x) = q_\theta(y|x) / (1 - q_\theta(y|x))
  • Log-odds ratio: Δ(x)=log[oddsθ(ypx)oddsθ(ynx)]\Delta(x) = \log \left[ \frac{\text{odds}_\theta(y_p|x)}{\text{odds}_\theta(y_n|x)} \right]
  • Odds-Ratio Penalty: LOR(x;yp,yn)=logσ(Δ(x))L_{OR}(x; y_p, y_n) = -\log \sigma(\Delta(x)), with σ\sigma the sigmoid.

The overall ORPO objective for a triplet is:

LORPO(x;yp,yn)=LSFT(x,yp)+λLOR(x;yp,yn)=logqθ(ypx)λlogσ(log[oddsθ(yp)/oddsθ(yn)])L_{ORPO}(x; y_p, y_n) = L_{SFT}(x, y_p) + \lambda L_{OR}(x; y_p, y_n) = -\log q_\theta(y_p|x) - \lambda \log \sigma(\log [\text{odds}_\theta(y_p)/\text{odds}_\theta(y_n)])

Typically, λ=1\lambda=1 is set to strongly favour correct teacher traces.

3. Mixed-Policy Negative Sampling Scheme

Negative traces (yny_n) are generated by sampling from:

  • Off-Policy (ϕ=0\phi=0): Fixed negatives from the untrained student at initialization.
  • On-Policy (ϕ=1\phi=1): Negatives freshly sampled from the current student checkpoint at each iteration.
  • Mixed-Policy (0<ϕ<10<\phi<1): At each minibatch, negatives are sampled from the latest checkpoint with probability ϕ\phi and from the base model otherwise.

This mixed strategy (default ϕ=0.5\phi=0.5) anchors the negative distribution in the original student model, preserving diversity, while also introducing increasingly challenging negatives as the student improves. Pure off-policy lacks curriculum and pure on-policy loses sample diversity—mixed-policy empirically yields optimal results.

4. Algorithmic Details

For each epoch, the training loop proceeds as follows:

  1. Teacher Sampling: For each dataset prompt xx, sample KK diverse teacher CoT traces yp(1...K)teacher(x)y_p^{(1...K)} \sim \text{teacher}(x).
  2. Negative Sampling per Iteration:
    • Draw uUniform(0,1)u \sim \text{Uniform}(0,1).
    • If uϕu \leq \phi: sample KK negatives from latest studentθt1(x)student_{\theta_{t-1}}(x) (on-policy branch).
    • Else: sample negatives from studentθ0(x)student_{\theta_0}(x) (off-policy branch).
  3. Triplet Formation: Assemble minibatch of triplets (x,yp,yn)(x, y_p, y_n).
  4. Objective Update: Compute LORPOL_{ORPO} and backpropagate to update θ\theta.

This continual refresh of contrastive examples supports robust generalization and effective distillation over increasing epochs.

5. Experimental Protocol and Comparative Performance

Empirical evaluation spans five QA and math datasets:

  • MedQA-USMLE (medical)
  • ARC-Challenge (science)
  • StrategyQA (commonsense)
  • OpenBookQA (science/facts)
  • GSM8K (math reasoning)

Teacher model: InternLM-2.5B-Chat

Student models:

  • InternLM-1.8B-Chat
  • TinyLlama-1.1B-Instruct

Baselines include zero-shot CoT, single-trace CoT fine-tuning, diverse-trace CoT fine-tuning (K=8K=8), off-policy ORPO, and on-policy ORPO.

Model–Method Avg Accuracy (%) Gain Over Diverse CoT FT (%)
TinyLlama Diverse CoT FT 37.6
TinyLlama Off-Policy ORPO 41.4 +3.8
TinyLlama On-Policy ORPO 39.0 +1.4
TinyLlama Mixed-Policy 43.2 +5.6
InternLM Diverse CoT FT 48.7
InternLM Off-Policy ORPO 53.9 +5.2
InternLM On-Policy ORPO 50.5 +1.8
InternLM Mixed-Policy 55.8 +7.1

Mixed-policy ORPO consistently recovers a substantial fraction of the teacher’s zero-shot performance (e.g., teacher zero-shot at 59.6%).

6. Analytical Insights and Ablations

Key findings from the analysis:

  • Diverse teacher traces (K=8K=8): Exposure to multiple valid solution chains reduces student overspecialization, with a 3–4 point accuracy improvement over single-trace fine-tuning.
  • Student-generated negatives: Using negatives sampled from student generations (vs teacher-sampled bad chains) yields superior accuracy (e.g., on MedQA, 49.3% vs 41.7%).
  • Mixed-policy sampling superiority: Pure on-policy sampling yields high-quality but non-diverse negatives; pure off-policy lacks difficulty progression. Mixed-policy achieves the optimal balance for contrastive learning efficacy.
  • Ablations: Filtering out near-duplicate traces (ROUGE-L > 0.8) maintains diversity. K=8K=8 selected as computational–performance trade-off.

ORPO-Distill establishes a continuum bridging white-box and black-box knowledge distillation methods: by recasting model distillation as structured preference optimization across heterogeneous reasoning traces—leveraging a principled odds-ratio penalty and a mixed-policy negative curriculum—ORPO-Distill demonstrates large and repeatable gains across architectures and domains (Singh et al., 29 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ORPO-Distill.