Papers
Topics
Authors
Recent
2000 character limit reached

BiPO: Bi-directional Preference Optimization

Updated 7 December 2025
  • BiPO is a framework for preference-based learning that uses bi-directional steering vectors to modulate behavior in language models and reinforcement learning policies.
  • It leverages both vector-based and EM-based optimization methods to achieve reversible control with minimal computational overhead.
  • Empirical results demonstrate improved truthfulness, reduced hallucinations, and effective cross-model transfer, highlighting its practical significance.

Bi-directional Preference Optimization (BiPO) is a general framework for preference-based learning with broad applicability across LLM steering, reinforcement learning from human feedback (RLHF), and bandit feedback. BiPO provides an effective, lightweight, and theoretically principled method for learning from both positive and negative feedback—paired or unpaired—enabling fine-grained, reversible behavioral control in LLMs and policy optimization settings. In LLMs, BiPO yields steering vectors that encode behaviorally meaningful directions in activation space, facilitating rapid and adjustable model personalization with minimal computational overhead compared to conventional fine-tuning or reinforcement learning approaches (Cao et al., 28 May 2024, Abdolmaleki et al., 5 Oct 2024).

1. Underlying Principles and Motivation

Preference optimization is central to modern behavioral alignment tasks, particularly in LLMs and RL. Classical approaches, such as full-model fine-tuning or RLHF, require substantial computational resources and carry risks of degrading core model capabilities. Recent activation-perturbation or "activation engineering" methods address this by introducing a fixed-length steering vector vv in hidden activation space, added at a designated layer LL to bias model outputs toward a target behavior. Prior methods, including Contrastive Activation Addition (CAA), extract vv as an average difference of activations between preferred and non-preferred prompt completions. However, these procedures commonly utilize a one-sided contrast, assuming the model follows specific appended choice tokens, and fail when model generations diverge from such guidance, especially in open-generation or safety-critical scenarios. This frequently leads to misalignment or unreliable steering directionality (Cao et al., 28 May 2024).

BiPO addresses these limitations by directly optimizing the vector vv to increase the likelihood ratio between complete, contrastively-labeled reference responses. This bi-directional (i.e., reversible) approach is grounded in probability and expectation-maximization (EM) theory, and can be instantiated as either a vector-based policy update (for LLMs or RL policies) or as an explicit preference-based modification of the likelihood landscape (Cao et al., 28 May 2024, Abdolmaleki et al., 5 Oct 2024).

2. Formal Objectives and Optimization Algorithms

2.1 Vector-based BiPO for LLM Steering

Given a frozen LLM π\pi, a designated hidden layer LL, and a dataset D={(qi,rTi,rOi)}\mathcal{D} = \{(q^i, r_T^i, r_O^i)\} of user prompts qiq^i and paired contrastive reference completions rTir_T^i (target behavior) and rOir_O^i (opposite behavior), BiPO introduces a learnable steering vector vRdv \in \mathbb{R}^d.

For each triple, define

ΔT(v)=logπL+1(rTAL(q)+v)πL+1(rTAL(q))\Delta_T(v) = \log \frac{\pi_{L+1}(r_T | A_L(q) + v)}{\pi_{L+1}(r_T | A_L(q))}

ΔO(v)=logπL+1(rOAL(q)+v)πL+1(rOAL(q))\Delta_O(v) = \log \frac{\pi_{L+1}(r_O | A_L(q) + v)}{\pi_{L+1}(r_O | A_L(q))}

where AL()A_L(\cdot) denotes the activation at layer LL for input tokens, and πL+1\pi_{L+1} is the suffix model.

The scalar loss is

Lone(v)=E(q,rT,rO)D[logσ(βΔT(v)βΔO(v))]+λv2\mathcal{L}_{\mathrm{one}}(v) = -\mathbb{E}_{(q,r_T,r_O)\sim\mathcal{D}} \left[ \log \sigma \left( \beta \cdot \Delta_T(v) - \beta \cdot \Delta_O(v) \right) \right ] + \lambda \|v\|^2

where σ\sigma is the logistic function, β>0\beta > 0 is a contrast sensitivity parameter, and λ\lambda regularizes the vector norm.

For symmetry, BiPO samples d{1,+1}d \in \{-1, +1\} and alternates optimization of vv and v-v to encode the bidirectional preference signal: minv  Ed,(q,rT,rO)D[logσ(dβΔT(dv)dβΔO(dv))]+λv2\min_v \; \mathbb{E}_{d, (q, r_T, r_O)\sim\mathcal{D}} \left[ -\log \sigma \left(d \beta \Delta_T(dv) - d \beta \Delta_O(dv) \right) \right] + \lambda\|v\|^2

Updates are performed via AdamW on the loss gradient. The procedure ensures +v+v steers for the target and v-v for the opposite, reliably centering the behavioral modification (Cao et al., 28 May 2024).

2.2 EM-based BiPO for General Preference Learning

BiPO generalizes to reward-based or Q-value-based settings using the EM-derived Preference-based Maximum a Posteriori Optimization (PMPO) (Abdolmaleki et al., 5 Oct 2024). Here, for contexts xx and actions or completions yy, binary feedback S{1,0}S \in \{1,0\} (preferred, dispreferred) is modeled: pθ(y,Sx)=πθ(yx)p(Sy,x)p_\theta(y, S|x) = \pi_\theta(y|x) \cdot p(S|y, x)

p(S=1y,x)exp(f(y,x)/η)p(S=1|y,x) \propto \exp(f(y,x)/\eta)

The objective maximizes the marginal log-likelihood of preferred outcomes: L(θ)=ilogπθ(yx(i))p(S=1y,x(i))dyL(\theta) = \sum_i \log \int \pi_\theta(y|x^{(i)}) p(S=1|y, x^{(i)}) dy

The EM formulation alternates:

  • E-step: q(yx)πref(yx)p(S=1y,x)q^*(y|x) \propto \pi_{ref}(y|x) p(S=1|y,x)
  • M-step: θk+1=argmaxθExμEyq[logπθ(yx)]\theta_{k+1} = \arg\max_\theta \mathbb{E}_{x \sim \mu} \mathbb{E}_{y \sim q^*} [\log \pi_\theta(y|x)]

The method extends beyond paired feedback: with unpaired or negative-only data, negative feedback is incorporated in the M-step with a KL anchor term, crucial for stability.

Full M-step objective: Jar(θ)=αEDa[logπθ(yx)](1α)EDr[logπθ(yx)]βEx[KL(πref(x)πθ(x))]J_{ar}(\theta) = \alpha \mathbb{E}_{D_a}[\log \pi_\theta(y|x)] - (1-\alpha) \mathbb{E}_{D_r}[\log \pi_\theta(y|x)] - \beta \mathbb{E}_x [KL(\pi_{ref}(\cdot|x) \| \pi_\theta(\cdot|x))] Here, DaD_a (accept/preferred) and DrD_r (reject/dispreferred) can be decoupled, with α[0,1]\alpha \in [0,1] trading off positive and negative emphasis (Abdolmaleki et al., 5 Oct 2024).

3. Experimental Results and Empirical Properties

3.1 LLM Steering Performance

Experiments on Llama-2-7b-chat-hf and Mistral-7B-Instruct demonstrate pronounced improvements over CAA and Freeform baselines:

Task (Llama-2-7b) Metric CAA Baseline BiPO (α=1) Range
Persona / Power-seeking GPT-4 Score ~1.7 1.2 → 2.4 (α scaled -2→+2)
Truthfulness (TruthfulQA) MC1 Acc. +<2% +10% (positive α), -10% (neg α)
Hallucination Halluc. Rate Unreliable 65% (max α) ↔ <5% (min α)
Jailbreak (ASR) Success Rate 0% 73% (α=+1); 0% defense (α=-1)

Scaling the applied vector (α\alpha) flexibly tunes the degree and direction of steerability. Effects on utility (MMLU accuracy) remain negligible (<0.5% variation for α1|\alpha| \leq 1), indicating preservation of foundational knowledge capabilities (Cao et al., 28 May 2024).

3.2 Transferability and Synergy

BiPO steering vectors vv^* exhibit substantial transferability:

  • Cross-model: Application from Llama-2-7b chat to Vicuna-7B yields similar persona steering curves.
  • Cross-LoRA: BiPO trained with Llama-2 also steers LoRA-fine-tuned derivatives (e.g., Llama-2-Chinese-7B-Chat), maintaining behavior control across languages.
  • Vector synergy: Additive composition of multiple vectors (e.g., "power" + "wealth") results in steering that expresses both behaviors in fused generations.

3.3 RL and Control Tasks

In synthetic bandit optimization, DeepMind Control Suite, and robotics (RGB stacking), BiPO/PMPO matches or outperforms established baselines including MPO and DPO, and maintains stable improvement under both positive-only and negative-only supervision. In negative-only settings, large KL regularization weight β\beta is essential to avoid policy collapse. Offline RL ablations confirm that combining accept, reject, and behavior cloning feedback yields the best returns (e.g., \approx93 vs. 77 with reject+BC alone) (Abdolmaleki et al., 5 Oct 2024).

4. Theoretical Insights and Implementation Details

BiPO is anchored in a principled EM formalism, generalizing preference optimization to handle unpaired or one-sided feedback—capabilities not shared by standard methods relying on pairwise comparisons. The bi-directional alternation ensures that both +v+v and v-v robustly encode behavior modulation, avoiding degenerate or unidirectional solutions. The update steps are designed to increase (or maintain) the log-likelihood of desired outcomes and are theoretically guaranteed to do so at each EM iteration until convergence (Abdolmaleki et al., 5 Oct 2024).

In LLM steering, the practical recipe is:

  1. Initialize v0v \leftarrow 0.
  2. For each step, sample minibatch of triples and direction d{1,1}d \in \{-1,1\}.
  3. Compute preference differentials ΔTi(dv)\Delta_T^i(dv) and ΔOi(dv)\Delta_O^i(dv).
  4. Compute loss and update vv using AdamW.
  5. At inference, apply +αv+\alpha v^* for target or αv-\alpha v^* for the opposite, tuning α[2,2]\alpha \in [-2, 2] as needed (Cao et al., 28 May 2024).

Layer selection ablation indicates steering is broadly effective between layers 10–18 (best at 15); vector efficacy saturates by 10–20 epochs; β\beta scaling in [0.1, 0.5] is robust but too high can destabilize.

5. Broader Applicability and Limitations

BiPO accommodates a wide variety of feedback regimes:

  • Positive-only (α=1\alpha=1), negative-only (α=0\alpha=0), or mixed (0<α<10<\alpha<1), each stabilized by tuning the KL anchor β\beta.
  • Absorbs unpaired and arbitrary feedback distributions, broadening practical utility relative to DPO or reward-maximizing approaches.

Limitations:

  • LLM steering using BiPO is presently single-layer; multi-layer variants may offer greater expressivity.
  • Vector extraction requires quality reference responses—if training pairs are noisy or biased, the learned vector may overfit or misalign.
  • Out-of-distribution generalization can be limited; extreme input queries far from the preference labeled dataset may be insufficiently steered without further adaptation.
  • For RL/control, reliable per-sample feedback and accurate reward/Q models are assumed; mis-specification or overfitting to reward artifacts is possible, suggesting regular independent evaluation (Cao et al., 28 May 2024, Abdolmaleki et al., 5 Oct 2024).

6. Implications for Future Research and Practical Recommendations

BiPO/PMPO introduces a general, scalable formalism for preference-guided behavioral control and alignment. Future work includes extensions to multi-layer steering vectors (stacking vv across several layers), automatic selection and curation of preference pairs for more resilient alignment, and comprehensive theoretical analyses of steering vector behavior and limits. In practical terms, use of initial reference policies, small learning rates, regularization clamping (β\beta), and validation via hold-out human or model judges is advised to guard against reward hacking and undesirable drift (Cao et al., 28 May 2024, Abdolmaleki et al., 5 Oct 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Bi-directional Preference Optimization (BiPO).