Double-layer Mixture-of-Experts Reward Models

Updated 21 December 2025

The paper introduces a hierarchical two-layer MoE that uses a sparse task router and dense capability experts to enhance reward modeling.
It employs a three-phase training strategy—full fine-tuning, LoRA injection, and MLP aggregation—to optimize human preference predictions.
Empirical results demonstrate improved preference consistency and stability over traditional models, confirming enhanced interpretability and efficiency.

Double-layer Mixture-of-Experts Reward Models (DMoERM) are hierarchical architectures designed to improve reward modeling in LLM alignment by explicitly decomposing input space across both task and finer-grained capability dimensions. DMoERM aims to mitigate two central challenges in reward modeling: multi-task disturbance—where a monolithic reward model trained on heterogeneous data degrades in generalization—and annotation noise, a consequence of low inter-annotator consistency in human preference datasets (Quan, 2024). The DMoERM architecture leverages a two-layer mixture-of-experts (MoE) formulation, where an outer sparse gating layer routes each example to a task-specific expert, and an inner dense layer scalarizes capability-specific LoRA modules using an MLP. Recent extensions further motivate DMoERM through probabilistic, variational, and interpretable frameworks (Bohne et al., 9 Oct 2025, Wang et al., 2024).

1. Architecture and Mathematical Specification

DMoERM implements a strict two-level MoE topology:

Outer Layer: Sparse Task Router

The input $x$ is assigned by a small router network $h(x) \in \mathbb{R}^T$ to one among $T$ task experts (e.g., text creation, roleplay, objective-QA, subjective-QA, chitchat).
The router output is normalized with softmax:

$p_t(x) = \frac{\exp(h_t(x))}{\sum_{t'} \exp(h_{t'}(x))}$

Top-1 routing selects task $t^* = \arg\max_t p_t(x)$ and activates only the associated inner expert $RM_{t^*}$ .
Only the routed subnetwork receives gradient updates, ensuring clear separation between tasks (Quan, 2024).

Inner Layer: Dense Capability Experts

Each $RM_t$ contains fine-grained LoRA modules, each targeting a capability dimension (e.g., for roleplay: personality, empathy).
For $k$ capabilities, task-specific base weights $W^{\mathrm{base}}$ are adapted per capability by LoRA $A^{(i)}$ :

$W^{\mathrm{base}+A^{(i)}}$

Each expert produces an embedding $Z_i \in \mathbb{R}^d$ , which is mapped to a scalar $r_i \in [0,1]$ via a sigmoid layer.
All embeddings are concatenated and aggregated by a 2-layer MLP $\mathrm{MLP}(Z_0 \oplus \ldots \oplus Z_{k-1})$ with non-linear activation, yielding the final scalar reward.

This approach sharply contrasts with monolithic and previous ensemble reward models by enabling dynamic specialization across both discrete tasks and multidimensional capabilities (Quan, 2024).

2. Training Methodology

DMoERM employs a three-phase curriculum-dependent procedure:

Phase 1—Task-specific Full Fine-tune:
- Each task expert is initialized by full fine-tuning on a relevant task-partitioned subset of the data.
Phase 2—Capability LoRA Injection:
- For every capability, a LoRA adapter is trained to specialize the base to a specific quality dimension, using capability-aligned preference data.
Phase 3—MLP Aggregation Head:
- The outputs of the per-capability experts are concatenated and mapped through an MLP trained to predict aggregate human preferences.
- Only the MLP is updated in this phase; base and LoRA weights are frozen (Quan, 2024).

The primary loss for aggregation is a pairwise log-sigmoid ranking loss: $L = -\,\mathbb{E}_{(x, y^+, y^-)} \bigl[\log \sigma(r(x, y^+) - r(x, y^-))\bigr]$ where $\sigma(u) = (1+e^{-u})^{-1}$ . Task routing ensures that multi-task disturbance is eliminated at the gradient level.

3. Annotation Strategy and Noise Mitigation

To address label noise due to low annotation consistency (typically 60–75% in standard RM datasets), DMoERM introduces:

Capability-level Annotations:

Each preference label is decomposed along multiple interpretable axes, greatly improving per-dimension consistency (empirically >80%) and enabling robust aggregation (Quan, 2024).

LLM-Assisted Label Generation:

Cost is minimized and consistency increased by generating capability labels via public LLM APIs. Labels are collected for all capability pairs, assessed in both orders of presentation, and retained only if consistent, which further filters noise.

Empirical Validation:

API-derived, multi-point labels achieve >80% consistency. Aggregated preference accuracy (final model) attains ~68% on held-out human judgements, surpassing single and ensemble baselines at ~62%.

4. Empirical Results and Performance Analysis

Key benchmarks confirm the effectiveness of DMoERM on reward modeling:

Preference Consistency:
- Single RM: 58.2%
- Ensembling baselines (mean, worst-case, uncertainty-weighted): ~62.4–62.6%
- GPT-4 zero/one-shot: 59.5/62.3%
- Removal of the outer layer reduces accuracy to 67.0% (Quan, 2024).
BoN Sampling and PPO Stability:

DMoERM demonstrates resistance to overoptimization (“reward hacking”) in both best-of-n (BoN) sampling—where outputs do not degrade in quality for large $n$ —and PPO reinforcement learning, where reward and policy remain stable even as the policy diverges (KL ≈ 8 nats).

Ablations:

Capability LoRA experts yield 80–86% accuracy on held-out single-point labels; the full aggregation phase achieves the highest final accuracy.

Resource Efficiency:

Training leverages LoRA for parameter efficiency; typical training per inner MoE instance is ~80 GPU-hours on 8×A100 80GB (Quan, 2024).

5. Probabilistic and Variational Extensions

The probabilistic interpretation in (Bohne et al., 9 Oct 2025) formalizes DMoERM as a hierarchical latent-variable model:

Latent Variables:

Two discrete latent variables $z_1$ (outer/task) and $z_2$ (inner/capability) govern hierarchical gating.

Gating Networks:

Hierarchical prior distributions $\pi_1(z_1|x)$ , $\pi_2(z_2|x, z_1)$ capture task and subtask assignment.

ELBO Training:

Variational inference maximizes:

$\mathcal{L}(\theta, \phi) = \mathbb{E}_{q_\phi}[\log p_\theta(r|x, z)] - D_{\mathrm{KL}}(q_\phi(z|x, r) \| p_\theta(z|x))$

with factorized or joint posteriors for layered expert assignments.

Architecture Flexibility:

Supports both shared-encoder + expert-specific heads and fully independent reward models, with soft contextual gating.

This view enables the direct integration of DMoERM into direct preference optimization (Mix- and MoE-DPO), facilitating stable and specialized learning in multi-task and multi-preference environments (Bohne et al., 9 Oct 2025).

6. Interpretability and Steerability Properties

Recent advances demonstrate that multi-objective DMoERM variants provide enhanced interpretability:

Layer 1: Multi-Objective Absolute-Rating Head (ArMoRM):

Produces a vector of human-interpretable scores (e.g., honesty, helpfulness, safety) using MSE regression to aligned ratings (Wang et al., 2024).

Layer 2: Contextual Gating for Scalarization:

A gating MLP maps input features to a softmax distribution over objectives, yielding a scalar as a convex combination of individual objectives.

$R(x, y) = \sum_{k=1}^K g_k \, r'_k(x, y)$

with verbosity correction applied per dimension.

Steerability:

Users can manually adjust or disable objectives by altering the gating vector at inference, enabling runtime control over safety or other critical factors.

This directly addresses the black-box nature of previous reward models and offers strong alignment with human values and preferences (Wang et al., 2024).

7. Practical Considerations and Limitations

Expert Count and Gating Capacity:

Empirical results recommend $K \approx 4-16$ for experts. Gating network hidden size typically matches embedding dimensionality to balance specialization and overfitting (Bohne et al., 9 Oct 2025).

Training Efficiency:

LoRA-based inner experts and two-phase (regression + pairwise) training optimize for cost and label utilization (Quan, 2024, Wang et al., 2024).

Annotation Bottlenecks:

DMoERM requires capability-specific absolute or pairwise ratings; obtaining sufficient, high-quality annotator coverage remains a challenge for full scalability.

Future Directions:

Joint end-to-end optimization of all gating and reward parameters (instead of staged training), automated penalty learning (e.g., for verbosity corrections), and richer context-dependent gating (e.g., user profile, session context) are identified as promising directions (Wang et al., 2024).

Key References:

"DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling" (Quan, 2024)
"Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization" (Bohne et al., 9 Oct 2025)
"Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts" (Wang et al., 2024)