Stackelberg Learning from Human Feedback

Updated 25 December 2025

SLHF is a game-theoretic framework that uses a sequential leader–follower structure to refine responses based on human preferences.
It improves over traditional RLHF and NLHF by effectively managing non-Markovian, intransitive, and stochastic feedback with enhanced convergence.
Practical implementations use two-timescale gradient updates and transformer adaptations, yielding significant win-rate improvements in empirical tests.

Stackelberg Learning from Human Feedback (SLHF) is a sequential game-theoretic framework for preference optimization in machine learning, particularly for aligning large models with complex, often non-transitive and non-Markovian human preferences. SLHF decomposes the learning process into a leader–follower structure, where the Leader commits to an action or response and the Follower observes and conditionally refines or responds. This framework generalizes and improves upon classical Reinforcement Learning from Human Feedback (RLHF) and Nash Learning from Human Feedback (NLHF), offering significant benefits in alignment robustness, convergence properties, and inference-time adaptability (Pásztor et al., 18 Dec 2025).

1. Stackelberg Game-Theoretic Formulation

SLHF frames learning from preferences as a two-player, sequential-move (Stackelberg) game. Given a learned pairwise preference function

$p(y\succ y'\mid x) = \Pr(y \text{ preferred to } y' \mid x),$

where $x$ is a context (e.g., prompt) and $y, y'$ are candidate responses, two policies are defined:

Leader policy $\pi_L:\mathcal X \rightarrow \Delta_{\mathcal Y}$ samples an initial response $y \sim \pi_L(\cdot\mid x)$ .
Follower policy $\pi_F:\mathcal X\times\mathcal Y\rightarrow\Delta_{\mathcal Y}$ observes $x, y$ and samples a refined response $y' \sim \pi_F(\cdot\mid x, y)$ .

The objective is

$J(\pi_L, \pi_F) = \mathbb{E}_{x\sim\rho}\left[ \mathbb{E}_{y\sim\pi_L, y'\sim\pi_F} [p(y\succ y'\mid x)] +\tau^F \mathbb{E}_{y\sim\pi_L} [KL(\pi_F(\cdot|x, y) || \pi_F^{ref}(\cdot|x, y))] -\tau^L KL(\pi_L(\cdot|x) || \pi_L^{ref}(\cdot|x)) \right]$

with $\tau^F, \tau^L$ KL-regularization coefficients. The Stackelberg equilibrium requires solving

$\pi_L^* = \arg\max_{\pi_L} J(\pi_L, \pi_F^*(\cdot|\cdot, \cdot;\pi_L)),$

where $\pi_F^*(\cdot|x, y;\pi_L)$ is the best response to the Leader, minimizing Follower loss conditional on the Leader's move (Pásztor et al., 18 Dec 2025).

2. SLHF Algorithms and Implementation

SLHF equilibrium is typically approximated via two-timescale gradient descent–ascent (2×GDA):

Sample $\{x_i\}$ from the context distribution.
For each $x_i$ , generate Leader sample $y_i \sim \pi_L(\cdot|x_i)$ and Follower sample $y'_i \sim \pi_F(\cdot|x_i, y_i)$ .
Compute preference scores $p_i = p(y_i \succ y'_i | x_i)$ .
Update policy parameters with gradients:

$\widehat\nabla_\theta J = \frac{1}{B}\sum_i (p_i - \tau^L k_i^L)\nabla_\theta \log\pi_\theta(y_i|x_i)$

$\widehat\nabla_\phi J = \frac{1}{B}\sum_i (p_i - \tau^F k_i^F)\nabla_\phi \log\omega_\phi(y'_i | x_i, y_i)$

where $k_i^L, k_i^F$ are ratios to the reference policies.

Parameters are stepped with learning rates $\alpha^L \ll \alpha^F$ .

When implementing on transformer architectures, the Leader and Follower can be collapsed into a single network with separate prompt templates. The two-timescale loss is composed of respective log-probability terms for Leader and Follower outputs (Pásztor et al., 18 Dec 2025).

In the RL setting with trajectory preference oracles, SLHF may be realized as a self-play algorithm where empirical win-rates over pairs of policy samples serve as non-scalar, trajectory-dependent rewards. Policy update proceeds via off-the-shelf RL optimizers (e.g., PPO, SAC) using batchwise self-comparisons, admitting robust credit assignment even for non-Markovian, non-transitive, or stochastic preference feedback (Swamy et al., 2024).

3. Comparison with RLHF and Nash-Equilibrium Approaches

The relationship between SLHF, RLHF, and NLHF can be summarized as follows:

Method	Solution Concept	Preference Assumptions	Typical Equilibrium
RLHF	Reward maximization via Bradley–Terry or similar reward modeling	Requires scalarizable, transitive preferences	Deterministic if consistent, but fails for cycles/intransitivity
NLHF	Simultaneous-move zero-sum (Nash) game	Arbitrary pairwise preferences (concave–convex)	Mixed, high entropy; symmetric strategies
SLHF	Sequential-move Stackelberg game	Arbitrary, possibly intransitive/non-Markovian	Often deterministic; robust under cycles/intransitivity (Pásztor et al., 18 Dec 2025, Swamy et al., 2024)

A key advantage of SLHF is its handling of intransitive preference structures. In Condorcet cycle scenarios, RLHF can become unstable or arbitrarily select among cycle participants, and NLHF equilibria degenerate to uniform mixtures. SLHF recovers robust (Condorcet-winner-like) policies by leveraging the Follower’s sequential conditionality and the Leader’s anticipation of best-response refinement (Pásztor et al., 18 Dec 2025).

Additionally, RLHF is sensitive to missing pairwise comparison data; omitted pairs can cause the learned argmax to shift unpredictably. SLHF's solution is dataset-independent in that the Leader's optimal policy is not acutely dependent on the annotated pairs (Pásztor et al., 18 Dec 2025).

4. Robustness, Preference Structure, and Theoretical Guarantees

SLHF is specifically robust to the following phenomena:

Non-Markovian preferences: The preference function may depend on entire trajectories or sequences, not just single-step states or actions. SLHF operates without fitting Markovian reward models, so path-dependent or history-sensitive feedback does not degrade policy learning (Swamy et al., 2024).
Intransitive preferences or cycles: MW (minimax-winner) aggregation and Stackelberg game-theoretic structure absorb cycles (e.g., $A \succ B \succ C \succ A$ ) via mixed or iterative policies (Pásztor et al., 18 Dec 2025).
Stochastic/Noisy feedback: Because rewards are empirical win-rates derived from preference queries among batches, stochastic fluctuations are averaged out, and no adversarial reward fitting can amplify label noise (Swamy et al., 2024).

Theoretical guarantees follow from online learning and no-regret analysis. In single-agent self-play, an average policy $\bar\pi$ after $T$ iterations with regret $\mathrm{Reg}(T)$ satisfies the minimax-winner property: $\max_{\pi}\min_{\pi'} A(\pi, \pi') - \min_{\pi'}A(\bar\pi, \pi') \leq \tfrac{2\mathrm{Reg}(T)}{T}$ For no-regret oracles with $\mathrm{Reg}(T)=O(\sqrt{T})$ , this yields convergence at $O(1/\sqrt{T})$ ; stronger guarantees ( $O(1/T)$ ) are available under unique gap assumptions (Swamy et al., 2024).

In Markov games with quantal-response models, regret bounds inherit the dependence on model class dimension and interaction horizon $H$ . Linear/myopic approximations enable $\tilde{O}(H^2 d\sqrt{T})$ regret with efficient updates (Chen et al., 2023).

SLHF uniquely enables inference-time refinement via iterative application of the Follower policy:

Sample initial response $y_1 \sim \pi_L^*(\cdot|x)$ .
For $i>1$ , sample iteratively $y_i \sim \pi_F^*(\cdot|x, y_{i-1})$ . This produces a chain of refined responses that efficiently traverses high-preference regions of response space.

In environments with cycles or substantial intransitivities, this method guarantees coverage—e.g., for a Condorcet cycle with options $\{A, B, C\}$ , a sufficiently long chain will expose all underlying user preferences within a bounded number of steps. Unlike RLHF or NLHF, which rely on i.i.d. sampling, SLHF's sequential sampling structure ensures both convergence and adaptability to individual user feedback (Pásztor et al., 18 Dec 2025).

Moreover, the Follower policy is transferable: it can refine outputs of models other than those seen during its own training, without fine-tuning. This property has been empirically validated on diverse generative models including Qwen2.5 and Llama-3.1 (Pásztor et al., 18 Dec 2025).

6. Empirical Results and Performance Scaling

Empirical studies across models from 0.5B to 8B parameters and datasets like HelpSteer2 (11.8K prompts, five annotated attributes) establish that SLHF consistently outperforms both RLHF variants (RLOO) and Nash-equilibrium NLHF methods (Nash-MD):

SLHF Leader: Achieves ≈73% win-rate vs. baseline; Follower refinements further boost performance to ≈80% (Pásztor et al., 18 Dec 2025).
SLHF Follower used on other models: Provides >60% improvements in win-rate relative to those models’ baselines.
Scaling: SLHF retains improvements (~80% win vs. baseline) at 1.5B and 3B model sizes; Leader performance lags at the largest model sizes but improves with longer training.
Robustness: In HelpSteer2, ~57% of prompts induce intransitive cycles among outputs; SLHF handles these naturally, while RLHF and NLHF show degraded or unstable performance.
Inference-time N-best scaling: SLHF's preference scores increase with $N\leq 5$ , in contrast to RLHF/NLHF collapse or early saturation.

On open-domain chat (Llama-3.1-8B), SLHF yields AlpacaEval 2.0 win-rates of ≈48% (vs. 44% DPO, 35% RLOO) and IFEval win-rates of ≈71% (compared to 62% DPO and 72% RLOO), indicating a preference–QA tradeoff (Pásztor et al., 18 Dec 2025).

7. Extensions, Model Structure, and Future Directions

SLHF generalizes to Markov games and settings involving strategic human feedback via quantal Stackelberg equilibria. The human is modeled as a quantally rational follower, with a temperature parameter estimated via maximum-likelihood from observed actions (Chen et al., 2023). Model-uncertainty can be managed through confidence sets and pessimistic/optimistic planning (offline/online), with regret bounds scaling in dimension and data as dictated by the underlying Bellman and quantal-response error classes.

For richer feedback—e.g., multi-step value disclosures or pairwise comparison of complete trajectories—the core MLE fitting algorithms extend straightforwardly by adding likelihood terms.

A plausible implication is that SLHF's sequential, anticipative structure provides a general foundation for constructing robust, user-adaptive, high-capacity models in domains where preferences are complex, intransitive, or contextually entangled, and where inference-time refinements are essential for personalized alignment.

Principal References: (Pásztor et al., 18 Dec 2025, Swamy et al., 2024, Chen et al., 2023)