DMoERM: Mixture-of-Experts RL Teacher

Updated 3 January 2026

The paper introduces DMoERM, a dual-layer MoE RL teacher that uses a two-stage expert routing to mitigate multi-task interference and noisy annotations.
DMoERM employs an outer sparse task router and an inner dense LoRA-based MoE structure to decompose tasks into fine-grained capability experts, stabilizing reward signals.
Empirical results show DMoERM improves agreement with human rankings and policy optimization stability, outperforming traditional reward models in RLHF.

A Mixture-of-Experts (MoE) RL Teacher is a reward modeling framework with a hierarchical MoE architecture designed to address fundamental challenges in preference-based reinforcement learning from human feedback (RLHF) for LLMs. The distinctive contribution of the DMoERM ("Double-Layer MoE Reward Model") is its two-stage expert routing; an outer sparse router partitions input by task (e.g., text creation, roleplay), while an inner dense MoE structure decomposes each task into capability sub-dimensions (e.g., intent conformity, expressiveness), each handled by a fine-tuned LoRA expert. This approach targets two pervasive issues in reward model (RM) training for LLM alignment fine-tuning: multi-task disturbance from heterogeneous data, and low inter-annotator agreement introducing label noise. By isolating task and capability contributions, DMoERM enhances reward signal fidelity, stabilizes policy optimization, and achieves superior alignment with human preferences (Quan, 2024).

1. Challenges in Reward Model Training for RLHF

Reward models are central to the alignment of LLMs via RLHF, guiding policy updates based on predicted human preference. Empirically, two obstacles degrade RM effectiveness:

Multi-task Interference: Aggregating data from disparate dialogue domains and preference axes in a single RM induces negative transfer. The model's generalization performance declines when simultaneously exposed to tasks with orthogonal objectives (e.g., roleplay vs. objective QA), as the shared representation is insufficiently specialized.
Noisy Preference Supervision: Human annotators exhibit only $60\%$ to $75\%$ pairwise agreement on preference data, so overall reward signals are substantially noisy. This impairs the learnability and predictive validity of RMs as alignment teachers.

The DMoERM architecture is constructed to directly address both issues by structurally partitioning tasks and capability factors.

2. Double-Layer Mixture-of-Experts Architecture

The DMoERM model deploys a two-level MoE hierarchy:

Outer Sparse MoE (Task Router)

For $T$ distinct tasks (e.g., text creation, roleplay, chitchat), input $x$ is processed by a frozen router (small transformer or MLP) generating task logits $u(x)=(u_0(x),...,u_{T-1}(x))$ .
The gating network applies a softmax: $g_0(x)_t = \frac{\exp(u_t(x))}{\sum_{i=0}^{T-1}\exp(u_i(x))}$ .
The top-scoring task index $t^* = \arg\max_t g_0(x)_t$ selects a single task-specific RM, avoiding multi-task disturbance and fixed inference cost.

Inner Dense MoE (Capability Experts via LoRA)

Each task $t$ is decomposed into $S$ capability dimensions (e.g., "intent conformity," "expressiveness").
The base RM for task $t$ , $RM^{base}_t$ with parameters $W^{(base)}$ , is extended by $S$ LoRA adapters $A^{(s)}$ , $W^{(s)} = W^{(base)} + A^{(s)}$ , each expert handling a distinct capability.
Processing $x$ with the $s$ -th expert yields embedding $z_{t,s}(x) \in \mathbb{R}^d$ and scalar capability score $r_{t,s}(x) = \sigma(w_{t,s}^T z_{t,s}(x) + b_{t,s})$ .

MLP Aggregator

The $S$ expert embeddings $z_{t,0}(x),...,z_{t,S-1}(x)$ are concatenated and fed to a two-layer MLP with PReLU activation:

$R(x) = \sigma \left( W_1 \cdot \text{PReLU}(W_0 \cdot z(x) + b_0) + b_1 \right)$

The MLP models non-linear interactions among capability dimensions, outputting a holistic scalar reward.

3. Training Paradigm

DMoERM training proceeds in three sequential stages per task $t$ :

Task-Specific Base Model Fine-Tuning: $60\%$ of preference pairs for task $t$ are used to full-parameter fine-tune $RM^{base}_t$ , yielding a monolithic task-specific RM.
Capability Expert LoRA Fine-Tuning:
- Pairs are annotated with single-capability preferences by querying a public LLM API (Baidu ERNIE Bot).
- To lessen annotation bias, each pair is scored in both swap orders, retaining only consistent pairs.
- $S$ LoRA adapters are trained on these cleaned, capability-labeled examples, one per capability.
Aggregator Head Training:
- Freeze the base model and all LoRA adapters.
- Use the remaining $40\%$ of task-specific data to train only the aggregator MLP.

Across all stages, the pairwise reward difference is trained via the logistic loss:

$L = -\mathbb{E}_{(x, y^+, y^-) \sim D} \left[ \log \sigma (R(x; y^+) - R(x; y^-) ) \right]$

No explicit sparsity penalty is imposed; the router remains pretrained and frozen throughout.

4. Handling Label Noise and Multi-Task Interference

Human annotation of overall dialog quality yields only $60$– $75\%$ consistency, but decomposing responses into fine-grained capability scores (five for text creation: "intent conformity," "expressiveness," "readability," "content richness," "logic") led to $80$– $90\%$ consistency on each capability and $75\%$ for the aggregated judgment. This suggests that factorized evaluation surfaces more robustly learnable signals and enables more consistent reward modeling.

By freezing the outer router, each per-task group is isolated from off-task samples, circumventing negative transfer evidenced by multi-task mixtures degrading accuracy from $56.7\%$ (single-task) to $52\%$ (multi-task) in ablation.

Automated capability labeling combined with positional-bias swap-filtering imparts data efficiency while cleansing noisy judgments, a fundamental advance over manual-only pipelines.

5. Empirical Results

Experimental evaluation of DMoERM demonstrates:

Preference Consistency: On manually labeled sets, DMoERM achieved $70.7\%$ agreement with human rankings, outperforming single reward models ( $58.2\%$ ), mean ensembles ( $62.4\%$ ), and advanced ensemble methods like UWO ( $62.6\%$ ). It surpassed zero-shot GPT-4 ( $59.5\%$ ) and one-shot GPT-4 ( $62.3\%$ ). The inner MoE alone (outer router ablated) retained $67.0\%$ consistency.
Best-of-n (BoN) Sampling: DMoERM-optimized policies yielded higher gold RM scores for $KL \approx 10$ nats; baselines over-optimized and failed by $KL \approx 4$ nats, whereas DMoERM remained stable beyond $KL \approx 8$ nats.
PPO Fine-Tuning: During PPO with KL penalty (steps = $3,000$), DMoERM-tuned policies outperformed all ensemble baselines on average gold RM scores and improved out-of-distribution generalization on AlignBench prompts.
Human Evaluation: Human judges preferred DMoERM outputs at rates up to $87\%$ in BoN and $68\%$ in PPO at select checkpoints.

Model	Human Agreement (%)
Single RM	58.2
Mean ensemble	62.4
UWO/WCO ensemble	62.6
Zero-shot GPT-4	59.5
One-shot GPT-4	62.3
DMoERM (full)	70.7
DMoERM (no router)	67.0

6. Interpretability and Reinforcement Learning Supervision

DMoERM confers practical interpretability advantages: capability expert scores $r_{t,s}(x)$ expose the contributing factors for each reward outcome, enabling inspection of why particular responses are preferred. As an RL teacher, DMoERM's structured reward signals address the overoptimization trap in best-of-n and RL fine-tuning settings and yield more stable policy improvements. The combination of per-task isolation and capability decomposition makes it a systematically better teacher in downstream RLHF, as reflected in human preference win rates, reward stability, and alignment generalization (Quan, 2024).

7. Conclusions and Implications

The double-layer mixture-of-experts framework, instantiated as DMoERM, leverages outer sparse gating and inner dense LoRA expert specialization to counteract multi-task and noisy-label interference in preference-based LLM alignment. This design achieves superior agreement with human preferences, mitigates failure-modes common in over-optimized reward modeling, and provides fine-grained interpretability of learned preferences. The use of API-based, swap-filtered annotation pipelines further improves data efficiency and label quality. A plausible implication is that structurally factorized reward models, with explicit task and capability disentanglement, can serve as more robust and effective teachers for both research-grade and large-scale RLHF efforts.

For implementation details, datasets, and code, see the public repository at https://github.com/quanshr/DMoERM-v1 (Quan, 2024).

PDF Markdown Chat (Pro)

References (1)

DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Mixture-of-Experts (MoE) RL Teacher.