IR→DPO Workflow in Preference Optimization

Updated 4 July 2026

IR→DPO Workflow is an integrated pipeline that converts intermediate representations—ranging from protein sequences to reasoning steps—into pairwise preferences for Direct Preference Optimization.
It tailors the construction of preference pairs to diverse domains such as protein design, mathematical reasoning, and chat alignment through domain-specific strategies.
The workflow emphasizes precise optimization of chosen versus rejected responses while enhancing scalability via clustering and group sampling, balancing data quality with computational efficiency.

A plausible unifying description of an IR→DPO workflow is a pipeline in which an intermediate representation is constructed from raw task data, model outputs, or structured annotations, and then converted into the pairwise preferences consumed by Direct Preference Optimization. Across recent work, the intermediate representation may be experimentally labeled protein variants in sequence space, step-indexed mathematical reasoning traces, ranked candidate responses, counterfactually styled prompt variants, or lightweight reasoning topologies. The common pattern is stable at the level of optimization—DPO trains a policy against a reference model from preferred and dispreferred pairs—while the representation, pair-construction rule, and computational strategy vary substantially across domains (Ferragu et al., 22 Oct 2025, Lai et al., 2024, Pattnaik et al., 2024, Butcher, 2024, Abdullah et al., 30 Apr 2026, Pan et al., 23 Aug 2025).

1. Formal basis and unifying objective

In the standard formulation restated across these papers, DPO operates on a prompt or context $x$ and a pair $(y_w,y_l)$ of preferred and rejected outputs. One common statement of the loss is

$\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_{win},y_{lose}) \sim D} \left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_{win}|x)}{\pi_{ref}(y_{win}|x)} - \beta \log \frac{\pi_{\theta}(y_{lose}|x)}{\pi_{ref}(y_{lose}|x)}\right)\right],$

where $\pi_\theta$ is the trainable policy, $\pi_{ref}$ is the fixed reference model, $\sigma$ is the sigmoid, and $\beta$ controls deviation from the reference (Lai et al., 2024).

The protein-language-model formulation in g-DPO relates this objective to a Bradley–Terry preference model,

$p^*(y_w \succ y_l \mid x) = \frac{\exp(r^*(x,y_w))}{\exp(r^*(x,y_w))+\exp(r^*(x,y_l))} = \sigma\!\big(r^*(x,y_w)-r^*(x,y_l)\big),$

and to a KL-regularized policy-optimization view,

$\pi^*(y\mid x) = \frac{1}{Z(x)} \pi_{\mathrm{ref}}(y\mid x)\exp\!\left(\tfrac{1}{\beta} r_\Phi(x,y)\right).$

These views are used to motivate preference optimization without a separately trained reward model (Ferragu et al., 22 Oct 2025).

A further theoretical lens appears in the data-centric analysis of DPO, which gives the optimal policy under a general distribution view as

$\pi_{\text{DPO}}(y|x)\propto \left(\frac{\pi_w(y|x)}{\pi_l(y|x)}\right)^{1/\beta}\cdot \text{ref}(y|x).$

This makes explicit that the learned policy is shaped by how often a response appears as chosen relative to rejected. A plausible implication is that the “IR” stage in an IR→DPO workflow is not merely preprocessing; it determines which regions of output space appear in $(y_w,y_l)$ 0 and $(y_w,y_l)$ 1, and therefore which behaviors DPO can materially update (Pan et al., 23 Aug 2025).

2. Intermediate representations and pair construction

The intermediate representation differs sharply across domains, but in each case it is the object from which the preference pair is derived rather than the final optimization target itself.

Workflow	Intermediate representation	Preference construction
g-DPO	labeled mutant measurements in sequence space	local preference pairs from experimentally labeled variants
Step-DPO	step sequence with first erroneous step localized	$(y_w,y_l)$ 2
Curry-DPO	4 ranked candidate responses per prompt	$(y_w,y_l)$ 3, $(y_w,y_l)$ 4, $(y_w,y_l)$ 5
Counterfactual DPO	control, treatment, and negative prompt variants	$(y_w,y_l)$ 6, $(y_w,y_l)$ 7, or $(y_w,y_l)$ 8
TUR-DPO	reasoning graph $(y_w,y_l)$ 9 for each candidate	pairwise DPO with graph-derived reward shaping and weighting

In protein engineering, the “IR” is explicitly described as the intermediate representation of experimentally measured mutants as ranked or pairwise-preference data. The workflow begins from scalar labels such as thermostability or expression, induces preferences from those labels, and constructs training pairs from local neighborhoods in sequence space rather than by exhaustively comparing every possible pair across the dataset (Ferragu et al., 22 Oct 2025).

In mathematical reasoning, the data pipeline is explicitly staged as Error collection → Step localization → Rectification. A wrong chain-of-thought solution $\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_{win},y_{lose}) \sim D} \left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_{win}|x)}{\pi_{ref}(y_{win}|x)} - \beta \log \frac{\pi_{\theta}(y_{lose}|x)}{\pi_{ref}(y_{lose}|x)}\right)\right],$ 0 is inspected until the first erroneous step $\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_{win},y_{lose}) \sim D} \left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_{win}|x)}{\pi_{ref}(y_{win}|x)} - \beta \log \frac{\pi_{\theta}(y_{lose}|x)}{\pi_{ref}(y_{lose}|x)}\right)\right],$ 1 is found; the preceding correct prefix $\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_{win},y_{lose}) \sim D} \left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_{win}|x)}{\pi_{ref}(y_{win}|x)} - \beta \log \frac{\pi_{\theta}(y_{lose}|x)}{\pi_{ref}(y_{lose}|x)}\right)\right],$ 2 becomes context, the erroneous step becomes $\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_{win},y_{lose}) \sim D} \left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_{win}|x)}{\pi_{ref}(y_{win}|x)} - \beta \log \frac{\pi_{\theta}(y_{lose}|x)}{\pi_{ref}(y_{lose}|x)}\right)\right],$ 3, and a correct continuation sampled from the reference model under the same prefix yields $\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_{win},y_{lose}) \sim D} \left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_{win}|x)}{\pi_{ref}(y_{win}|x)} - \beta \log \frac{\pi_{\theta}(y_{lose}|x)}{\pi_{ref}(y_{lose}|x)}\right)\right],$ 4. The resulting training instance is $\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_{win},y_{lose}) \sim D} \left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_{win}|x)}{\pi_{ref}(y_{win}|x)} - \beta \log \frac{\pi_{\theta}(y_{lose}|x)}{\pi_{ref}(y_{lose}|x)}\right)\right],$ 5 (Lai et al., 2024).

In curriculum-based chat alignment, each prompt has four candidate responses $\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_{win},y_{lose}) \sim D} \left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_{win}|x)}{\pi_{ref}(y_{win}|x)} - \beta \log \frac{\pi_{\theta}(y_{lose}|x)}{\pi_{ref}(y_{lose}|x)}\right)\right],$ 6 ranked from best to worst, and the paper constructs three pairs by holding the top response fixed and pairing it against the remaining three: $\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_{win},y_{lose}) \sim D} \left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_{win}|x)}{\pi_{ref}(y_{win}|x)} - \beta \log \frac{\pi_{\theta}(y_{lose}|x)}{\pi_{ref}(y_{lose}|x)}\right)\right],$ 7, $\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_{win},y_{lose}) \sim D} \left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_{win}|x)}{\pi_{ref}(y_{win}|x)} - \beta \log \frac{\pi_{\theta}(y_{lose}|x)}{\pi_{ref}(y_{lose}|x)}\right)\right],$ 8, and $\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_{win},y_{lose}) \sim D} \left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_{win}|x)}{\pi_{ref}(y_{win}|x)} - \beta \log \frac{\pi_{\theta}(y_{lose}|x)}{\pi_{ref}(y_{lose}|x)}\right)\right],$ 9. Difficulty is then defined by GPT-4 score difference, human preference ranking, or log-probability gap (Pattnaik et al., 2024).

Counterfactual DPO uses prompt variants as the intermediate object. A base instruction prompt $\pi_\theta$ 0 is transformed into a Control Prompt $\pi_\theta$ 1, a Treatment / Positive Prompt $\pi_\theta$ 2 or $\pi_\theta$ 3, and optionally a Negative Prompt $\pi_\theta$ 4. The same model generates $\pi_\theta$ 5, $\pi_\theta$ 6, and $\pi_\theta$ 7, and these are converted into synthetic preference pairs under one of three schemes: Counterfactual DPO (ENC), Counterfactual DPO (DIS), or Contrastive DPO (Butcher, 2024).

TUR-DPO inserts an additional representational layer by decomposing each response into atomic subclaims or reasoning steps and representing them as a directed graph $\pi_\theta$ 8, with nodes as subclaims and edges as support or dependency links. The graph is then sanitized before semantic, topological, and uncertainty signals are extracted (Abdullah et al., 30 Apr 2026).

3. Granularity of supervision, curriculum structure, and topology-aware shaping

A central axis along which IR→DPO workflows differ is the unit of preference optimization. Standard answer-level DPO treats the entire completion as preferred or rejected. Several variants argue that this granularity is too coarse for their target domain.

Step-DPO is the clearest step-level reformulation. Its critique of vanilla DPO is that a wrong final answer often arises from a single intermediate mistake in an otherwise mostly correct derivation. Penalizing the whole answer therefore penalizes correct reasoning steps before the error and adds noise. Step-DPO changes only the unit of optimization: it replaces answer-level conditional probabilities with step-conditional probabilities, training the model to increase the probability of the correct next reasoning step while suppressing the first erroneous step, conditioned on the preceding correct reasoning trace (Lai et al., 2024).

Curry-DPO changes granularity in a different way: it retains response-level pairs, but replaces the usual single chosen/rejected pair with multiple ranked pairs per prompt and imposes an explicit easy-to-hard curriculum. The scheduling mechanism uses three iterations corresponding to the three pairs, and the strongest variant updates the reference model to the previous iteration’s checkpoint rather than keeping the SFT model fixed throughout (Pattnaik et al., 2024).

In protein design, the granularity argument is local rather than stepwise. The g-DPO workflow states that preferences are not generated by global thresholding alone. Instead, local comparisons in sequence space are treated as more informative because they preserve subtle effects of few mutations and avoid collapsing many variants into the same coarse binary class (Ferragu et al., 22 Oct 2025).

TUR-DPO extends the preference signal beyond winner-versus-loser labels by incorporating semantic faithfulness, topology quality, and uncertainty into the margin itself. Its topology score is

$\pi_\theta$ 9

and the pairwise objective is

$\pi_{ref}$ 0

with uncertainty-based weight

$\pi_{ref}$ 1

This redefines the workflow so that the intermediate representation is not only a source of pair labels but also a source of calibrated instance weights and shaped margins (Abdullah et al., 30 Apr 2026).

4. Scalability and computational amortization

The most explicit scalability treatment appears in g-DPO, where the IR→DPO workflow is designed to avoid the quadratic cost of standard preference construction and scoring for protein LLMs. The core pruning mechanism is union-mask clustering, which groups aligned sequences that differ only at a small set of positions. For aligned sequences $\pi_{ref}$ 2, each $\pi_{ref}$ 3, the union mask is

$\pi_{ref}$ 4

Clusters are built greedily and agglomeratively, using merge cost

$\pi_{ref}$ 5

and stopping when

$\pi_{ref}$ 6

The stated complexity is

$\pi_{ref}$ 7

For larger datasets, the method first coarse-clusters with MMseqs2 and then runs union-mask clustering within buckets (Ferragu et al., 22 Oct 2025).

The second acceleration is group sampling with shared union masks. For MLMs such as ESM-2, sequence likelihood is commonly approximated with pseudo-log-likelihood, which normally requires masking and scoring positions one by one. g-DPO instead masks all differing positions in a pair or group at once, runs one forward pass, and approximates likelihood as a sum over the masked positions. For a group $\pi_{ref}$ 8 of size $\pi_{ref}$ 9, one union mask

$\sigma$ 0

supports approximate likelihoods for all sequences in the group via

$\sigma$ 1

The DPO loss is then evaluated over all $\sigma$ 2 comparisons inside the group; with $\sigma$ 3, one forward pass yields 6 pairwise comparisons rather than 6 separate forward passes. The paper explicitly states that clustering controls the training signal, while grouping controls efficiency (Ferragu et al., 22 Oct 2025).

In experiments, the full workflow starts from evo-tuning of ESM-2-650 on evolutionarily related sequences, uses cluster threshold $\sigma$ 4, group size $\sigma$ 5, $\sigma$ 6, batch size 64, and SGD on a single NVIDIA A100 GPU. Across Anti-SARS-CoV-2 VHH, Trastuzumab scFv, and Haloalkane dehalogenase, g-DPO converges 1.8 to 3.7 times faster than standard DPO while maintaining in-silico and in-vitro performance that is statistically indistinguishable from standard DPO (Ferragu et al., 22 Oct 2025).

5. Data quality, synthetic supervision, and empirical behavior

The data-centric study of DPO argues that the quality of the chosen response is the dominant factor in DPO performance, while the quality of the rejected response has a relatively limited impact. Its support theorem states that if a high-reward response $\sigma$ 7 is not in the support of the underlying data-generating distribution, then DPO will not update its probability:

$\sigma$ 8

The same paper further reports that fixing chosen quality and varying rejected quality produces small and non-monotonic changes, whereas fixing rejected quality and improving chosen quality produces steady gains across AlpacaEval-2 / LC-AE2, MMLU, IFEval, TruthfulQA, and GSM8K. It also shows that an online DPO regime with fixed high-quality chosen samples is approximately continual SFT on the chosen responses plus regularization toward the reference (Pan et al., 23 Aug 2025).

This emphasis on the positive side of the pair is echoed by Step-DPO’s in-distribution finding. The paper compares self-generated preferred steps with human- or GPT-4-generated rectifications and reports that Qwen2-7B-SFT + Step-DPO (OOD) reaches MATH 55.1, while Qwen2-7B-SFT + Step-DPO (ID) reaches MATH 55.8. The stated interpretation is that human/GPT-4 rectifications are out-of-distribution relative to the base model, receive much lower probability under the reference model, and therefore induce weaker optimization (Lai et al., 2024).

Counterfactual DPO pushes the same idea further by generating the entire preference dataset synthetically through prompt manipulation rather than human ranking. It uses Mistral-7B-Instruct-v0.2, 1,000 samples, and 2 epochs on RTX 3090, and evaluates entity redaction, bias reduction, hallucination reduction, and instruction negation. On CNN/DailyMail, average entities mentioned fall from 4.37 for the base model to 0.20 under Counterfactual DPO (ENC), while Hellaswag changes from 81.8% to 81.2%. On BBQ, Counterfactual DPO (DIS) reaches 69.6% prompted / 66.9% unprompted, and on the Vectara factual-consistency setup, Contrastive DPO reaches 94.3% prompted / 96.0% unprompted (Butcher, 2024).

Across larger-scale task results, Step-DPO reports that as few as 10K step-wise preference pairs and fewer than 500 training steps can yield a nearly 3% gain on MATH for models with over 70B parameters, and that Qwen2-72B-Instruct with Step-DPO achieves 70.8% on MATH and 94.0% on GSM8K. Curry-DPO reports 7.43 on MT-Bench with Zephy-7B and adjusted win rates of 90.7%, 87.1%, and 87.9% on Vicuna, WizardLM, and UltraFeedback respectively, with notable gains of up to 7.5% compared with standard DPO (Lai et al., 2024, Pattnaik et al., 2024).

A recurring misconception is that DPO improvement is primarily about finding maximally bad negatives or maximally wide preference gaps. The combined empirical record is narrower: these papers consistently treat the construction of high-quality, well-localized, or in-distribution chosen examples as the main lever, while negatives, curricula, and topology signals sharpen or stabilize that signal rather than replacing it (Pan et al., 23 Aug 2025).

6. Scope, limitations, and the separate double-pushout meaning of “DPO”

The surveyed workflows impose materially different assumptions. g-DPO depends on mutation-local structure: grouping alone can hurt performance when mutation spans are large, because the union mask becomes too broad, and the ablations report that performance stays essentially unchanged until about $\sigma$ 9, after which useful signal is lost. Step-DPO has no separate retrieval module or multi-round search algorithm; its iterative component is limited to wrong-solution generation, first-error localization, continuation resampling, and filtering. Curry-DPO relies on rankings or reliable judges and demonstrates only three preference pairs and three iterations. Counterfactual DPO assumes the base model is sufficiently promptable for the desired or undesired behavior and notes that abstract behaviors such as “being unbiased” are harder to encode than concrete behaviors like concision or pirate style. TUR-DPO remains offline and RL-free, but its ablations state that topology extractor quality matters, that removing uncertainty weighting lowers win-rate and worsens ECE, and that EMA reference updates offer small but consistent stability and calibration gains (Ferragu et al., 22 Oct 2025, Lai et al., 2024, Pattnaik et al., 2024, Butcher, 2024, Abdullah et al., 30 Apr 2026).

The acronym DPO is also genuinely ambiguous. In the alignment literature discussed above, it denotes Direct Preference Optimization. In graph-rewriting theory, however, DPO denotes the double-pushout construction. In “Semantics-Preserving DPO-Based Term Graph Rewriting,” term graphs are the intermediate representation for functional programs and data-flow graphs, and rewriting is formulated as a span

$\beta$ 0

Applied to a host term graph $\beta$ 1 via an injective matching satisfying the dangling condition, the resulting graph $\beta$ 2 preserves semantics whenever the rule itself is semantics-preserving:

$\beta$ 3

This is a distinct tradition: here the “IR→DPO” pattern means term-graph IR followed by double-pushout rewriting, not preference optimization (Kahl et al., 2019).

This terminological split matters because it marks the boundary of the modern alignment usage. In current DPO-based alignment and protein-design work, the essential workflow is: construct an intermediate representation that makes preference structure explicit, convert that representation into informative chosen/rejected comparisons, and then optimize a policy against a reference model. What changes from paper to paper is the representational substrate—sequence neighborhoods, reasoning steps, ranked responses, prompt variants, or reasoning topologies—and therefore the operational meaning of the arrow from IR to DPO.