Papers
Topics
Authors
Recent
2000 character limit reached

DMoERM: Mixture-of-Experts RL Teacher

Updated 3 January 2026
  • The paper introduces DMoERM, a dual-layer MoE RL teacher that uses a two-stage expert routing to mitigate multi-task interference and noisy annotations.
  • DMoERM employs an outer sparse task router and an inner dense LoRA-based MoE structure to decompose tasks into fine-grained capability experts, stabilizing reward signals.
  • Empirical results show DMoERM improves agreement with human rankings and policy optimization stability, outperforming traditional reward models in RLHF.

A Mixture-of-Experts (MoE) RL Teacher is a reward modeling framework with a hierarchical MoE architecture designed to address fundamental challenges in preference-based reinforcement learning from human feedback (RLHF) for LLMs. The distinctive contribution of the DMoERM ("Double-Layer MoE Reward Model") is its two-stage expert routing; an outer sparse router partitions input by task (e.g., text creation, roleplay), while an inner dense MoE structure decomposes each task into capability sub-dimensions (e.g., intent conformity, expressiveness), each handled by a fine-tuned LoRA expert. This approach targets two pervasive issues in reward model (RM) training for LLM alignment fine-tuning: multi-task disturbance from heterogeneous data, and low inter-annotator agreement introducing label noise. By isolating task and capability contributions, DMoERM enhances reward signal fidelity, stabilizes policy optimization, and achieves superior alignment with human preferences (Quan, 2024).

1. Challenges in Reward Model Training for RLHF

Reward models are central to the alignment of LLMs via RLHF, guiding policy updates based on predicted human preference. Empirically, two obstacles degrade RM effectiveness:

  • Multi-task Interference: Aggregating data from disparate dialogue domains and preference axes in a single RM induces negative transfer. The model's generalization performance declines when simultaneously exposed to tasks with orthogonal objectives (e.g., roleplay vs. objective QA), as the shared representation is insufficiently specialized.
  • Noisy Preference Supervision: Human annotators exhibit only 60%60\% to 75%75\% pairwise agreement on preference data, so overall reward signals are substantially noisy. This impairs the learnability and predictive validity of RMs as alignment teachers.

The DMoERM architecture is constructed to directly address both issues by structurally partitioning tasks and capability factors.

2. Double-Layer Mixture-of-Experts Architecture

The DMoERM model deploys a two-level MoE hierarchy:

Outer Sparse MoE (Task Router)

  • For TT distinct tasks (e.g., text creation, roleplay, chitchat), input xx is processed by a frozen router (small transformer or MLP) generating task logits u(x)=(u0(x),...,uT1(x))u(x)=(u_0(x),...,u_{T-1}(x)).
  • The gating network applies a softmax: g0(x)t=exp(ut(x))i=0T1exp(ui(x))g_0(x)_t = \frac{\exp(u_t(x))}{\sum_{i=0}^{T-1}\exp(u_i(x))}.
  • The top-scoring task index t=argmaxtg0(x)tt^* = \arg\max_t g_0(x)_t selects a single task-specific RM, avoiding multi-task disturbance and fixed inference cost.

Inner Dense MoE (Capability Experts via LoRA)

  • Each task tt is decomposed into SS capability dimensions (e.g., "intent conformity," "expressiveness").
  • The base RM for task tt, RMtbaseRM^{base}_t with parameters W(base)W^{(base)}, is extended by SS LoRA adapters A(s)A^{(s)}, W(s)=W(base)+A(s)W^{(s)} = W^{(base)} + A^{(s)}, each expert handling a distinct capability.
  • Processing xx with the ss-th expert yields embedding zt,s(x)Rdz_{t,s}(x) \in \mathbb{R}^d and scalar capability score rt,s(x)=σ(wt,sTzt,s(x)+bt,s)r_{t,s}(x) = \sigma(w_{t,s}^T z_{t,s}(x) + b_{t,s}).

MLP Aggregator

  • The SS expert embeddings zt,0(x),...,zt,S1(x)z_{t,0}(x),...,z_{t,S-1}(x) are concatenated and fed to a two-layer MLP with PReLU activation:

R(x)=σ(W1PReLU(W0z(x)+b0)+b1)R(x) = \sigma \left( W_1 \cdot \text{PReLU}(W_0 \cdot z(x) + b_0) + b_1 \right)

  • The MLP models non-linear interactions among capability dimensions, outputting a holistic scalar reward.

3. Training Paradigm

DMoERM training proceeds in three sequential stages per task tt:

  1. Task-Specific Base Model Fine-Tuning: 60%60\% of preference pairs for task tt are used to full-parameter fine-tune RMtbaseRM^{base}_t, yielding a monolithic task-specific RM.
  2. Capability Expert LoRA Fine-Tuning:
    • Pairs are annotated with single-capability preferences by querying a public LLM API (Baidu ERNIE Bot).
    • To lessen annotation bias, each pair is scored in both swap orders, retaining only consistent pairs.
    • SS LoRA adapters are trained on these cleaned, capability-labeled examples, one per capability.
  3. Aggregator Head Training:
    • Freeze the base model and all LoRA adapters.
    • Use the remaining 40%40\% of task-specific data to train only the aggregator MLP.

Across all stages, the pairwise reward difference is trained via the logistic loss:

L=E(x,y+,y)D[logσ(R(x;y+)R(x;y))]L = -\mathbb{E}_{(x, y^+, y^-) \sim D} \left[ \log \sigma (R(x; y^+) - R(x; y^-) ) \right]

No explicit sparsity penalty is imposed; the router remains pretrained and frozen throughout.

4. Handling Label Noise and Multi-Task Interference

Human annotation of overall dialog quality yields only $60$–75%75\% consistency, but decomposing responses into fine-grained capability scores (five for text creation: "intent conformity," "expressiveness," "readability," "content richness," "logic") led to $80$–90%90\% consistency on each capability and 75%75\% for the aggregated judgment. This suggests that factorized evaluation surfaces more robustly learnable signals and enables more consistent reward modeling.

By freezing the outer router, each per-task group is isolated from off-task samples, circumventing negative transfer evidenced by multi-task mixtures degrading accuracy from 56.7%56.7\% (single-task) to 52%52\% (multi-task) in ablation.

Automated capability labeling combined with positional-bias swap-filtering imparts data efficiency while cleansing noisy judgments, a fundamental advance over manual-only pipelines.

5. Empirical Results

Experimental evaluation of DMoERM demonstrates:

  • Preference Consistency: On manually labeled sets, DMoERM achieved 70.7%70.7\% agreement with human rankings, outperforming single reward models (58.2%58.2\%), mean ensembles (62.4%62.4\%), and advanced ensemble methods like UWO (62.6%62.6\%). It surpassed zero-shot GPT-4 (59.5%59.5\%) and one-shot GPT-4 (62.3%62.3\%). The inner MoE alone (outer router ablated) retained 67.0%67.0\% consistency.
  • Best-of-n (BoN) Sampling: DMoERM-optimized policies yielded higher gold RM scores for KL10KL \approx 10 nats; baselines over-optimized and failed by KL4KL \approx 4 nats, whereas DMoERM remained stable beyond KL8KL \approx 8 nats.
  • PPO Fine-Tuning: During PPO with KL penalty (steps = $3,000$), DMoERM-tuned policies outperformed all ensemble baselines on average gold RM scores and improved out-of-distribution generalization on AlignBench prompts.
  • Human Evaluation: Human judges preferred DMoERM outputs at rates up to 87%87\% in BoN and 68%68\% in PPO at select checkpoints.
Model Human Agreement (%)
Single RM 58.2
Mean ensemble 62.4
UWO/WCO ensemble 62.6
Zero-shot GPT-4 59.5
One-shot GPT-4 62.3
DMoERM (full) 70.7
DMoERM (no router) 67.0

6. Interpretability and Reinforcement Learning Supervision

DMoERM confers practical interpretability advantages: capability expert scores rt,s(x)r_{t,s}(x) expose the contributing factors for each reward outcome, enabling inspection of why particular responses are preferred. As an RL teacher, DMoERM's structured reward signals address the overoptimization trap in best-of-n and RL fine-tuning settings and yield more stable policy improvements. The combination of per-task isolation and capability decomposition makes it a systematically better teacher in downstream RLHF, as reflected in human preference win rates, reward stability, and alignment generalization (Quan, 2024).

7. Conclusions and Implications

The double-layer mixture-of-experts framework, instantiated as DMoERM, leverages outer sparse gating and inner dense LoRA expert specialization to counteract multi-task and noisy-label interference in preference-based LLM alignment. This design achieves superior agreement with human preferences, mitigates failure-modes common in over-optimized reward modeling, and provides fine-grained interpretability of learned preferences. The use of API-based, swap-filtered annotation pipelines further improves data efficiency and label quality. A plausible implication is that structurally factorized reward models, with explicit task and capability disentanglement, can serve as more robust and effective teachers for both research-grade and large-scale RLHF efforts.

For implementation details, datasets, and code, see the public repository at https://github.com/quanshr/DMoERM-v1 (Quan, 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Mixture-of-Experts (MoE) RL Teacher.