Papers
Topics
Authors
Recent
Search
2000 character limit reached

TrinityX: Calibrated MoE Alignment Model

Updated 19 February 2026
  • The paper introduces TrinityX, which integrates calibrated routing among expert FFNs to optimize LLM alignment for helpfulness, harmlessness, and honesty.
  • TrinityX leverages top-1 routing and dual regularization (entropy and KL) to achieve up to 57% reduction in inference time and 40% memory savings over dense MoE models.
  • The model generalizes across various LLM backbones by employing modular low-rank adapters and a composite training strategy, enhancing alignment and scalability.

TrinityX, a Mixture-of-Calibrated Experts (MoCaE) framework, is a Transformer-based LLM specifically designed to optimize alignment across the three desiderata of Helpfulness, Harmlessness, and Honesty (HHH). It systematically addresses the limitations of prior Mixture-of-Experts (MoE) architectures in LLM alignment by integrating expert modularity with a calibrated routing mechanism. TrinityX demonstrates substantial empirical improvements in standard alignment benchmarks, while achieving significant runtime and memory efficiency gains compared to dense MoE and baseline architectures (Kashyap et al., 10 Sep 2025).

1. Model Architecture

TrinityX modifies the standard Transformer by replacing every Feed-Forward Network (FFN) block with a dedicated MoCaE module. Each module incorporates three independently specialized expert FFNs EHE_H, ESE_S, and ETE_T, responsible for optimizing Helpfulness, Safety (Harmlessness), and Truthfulness (Honesty), respectively. Given an input token sequence xx, the Transformer’s attention sublayer yields a hidden state hRdh \in \mathbb{R}^d, which is simultaneously processed by all three expert FFNs:

yH=EH(h),yS=ES(h),yT=ET(h)y_H = E_H(h), \quad y_S = E_S(h), \quad y_T = E_T(h)

A lightweight router network α()\alpha(\cdot) computes logits ziz_i for each expert, transformed via a temperature-scaled softmax to obtain routing probabilities πi\pi_i:

zi=Wr(i)h+br(i)i{H,S,T}z_i = W_r^{(i)} h + b_r^{(i)} \quad\forall i \in \{H, S, T\}

πi=exp(zi/τ)jexp(zj/τ)\pi_i = \frac{\exp(z_i/\tau)}{\sum_j \exp(z_j/\tau)}

Each expert carries a fixed importance weight γ~i\tilde\gamma_i, used to calibrate the routing:

gi(h)=πiγ~i,igi(h)=1g_i(h) = \pi_i \cdot \tilde\gamma_i, \quad \sum_i g_i(h) = 1

The fused expert output is constructed as:

htrinity=i{H,S,T}gi(h)Ei(h)h_{\mathrm{trinity}} = \sum_{i \in \{H, S, T\}} g_i(h) E_i(h)

This output is layer-normalized and added to the residual hh before being forwarded to the next block.

Calibration of the router is reinforced via two regularization terms: an entropy penalty,

Lentropy=iπilogπi\mathcal{L}_{\rm entropy} = -\sum_i \pi_i \log \pi_i

which prevents premature specialization, and a temporal KL penalty,

LKL=KL(ππprev)\mathcal{L}_{\rm KL} = \textrm{KL}(\pi \| \pi_{\rm prev})

which discourages abrupt gating shifts across tokens or layers. The router's objective is thus a combination of task loss and these regularizations:

Lrouter=Ltask+λ1Lentropy+λ2LKL\mathcal{L}_{\rm router} = \mathcal{L}_{\rm task} + \lambda_1 \mathcal{L}_{\rm entropy} + \lambda_2 \mathcal{L}_{\rm KL}

2. Training Procedure and Alignment Objectives

TrinityX training proceeds in two phases:

(a) Task-vector fine-tuning: The base Transformer's parameters θ0\theta_0 are frozen, and for each alignment target i{helpful,harmless,honest}i \in \{\mathrm{helpful, harmless, honest}\}, a low-rank adapter TiRr×d\mathcal{T}_i \in \mathbb{R}^{r \times d} is learned on a designated dataset Di\mathcal{D}_i. This results in three experts, each trained independently via cross-entropy on their respective domains:

LH=E(x,y)DhelpfullogpEH(yx)\mathcal{L}_H = -\mathbb{E}_{(x,y)\sim\mathcal{D}_{\rm helpful}}\log p_{E_H}(y|x)

LS=E(x,s)Dharmlessc{safe,unsafe}1s=clogpES(cx)\mathcal{L}_S = -\mathbb{E}_{(x,s)\sim\mathcal{D}_{\rm harmless}}\sum_{c \in \{\text{safe},\text{unsafe}\}}\mathbf{1}_{s=c}\log p_{E_S}(c|x)

LT=E(x,t)Dhonestc{truthful,hallucinated}1t=clogpET(cx)\mathcal{L}_T = -\mathbb{E}_{(x,t)\sim\mathcal{D}_{\rm honest}}\sum_{c \in \{\text{truthful},\text{hallucinated}\}}\mathbf{1}_{t=c}\log p_{E_T}(c|x)

(b) Joint MoCaE calibration: With expert parameters fixed or lightly fine-tuned, the router is trained end-to-end on a composite objective:

Ltotal=ωHLH+ωSLS+ωTLT+λ1Lentropy+λ2LKL\mathcal{L}_{\rm total} = \omega_H \mathcal{L}_H + \omega_S \mathcal{L}_S + \omega_T \mathcal{L}_T + \lambda_1 \mathcal{L}_{\rm entropy} + \lambda_2 \mathcal{L}_{\rm KL}

Weights ωi\omega_i are typically set equal. The back-propagation of this loss enables the router to assign expert contributions adaptively per input context and alignment trade-off.

3. Computational Efficiency

TrinityX’s design confers significant memory and speed optimizations. In standard dense Transformers, each FFN imposes a per-layer memory cost of 8d28d^2. A naïve EE-expert MoE results in E8d2E \cdot 8d^2 cost. MoCaE activates only kEk \ll E experts per token (typically k=1k=1 via top-1 routing), yielding:

MMoCaE8d2+k×8d2=(1+k)8d2M_{\rm MoCaE} \approx 8d^2 + k \times 8d^2 = (1+k) 8d^2

The relative memory saving is:

MMoEfullMMoCaEMMoEfull=E(1+k)E\frac{M_{\rm MoE_{full}} - M_{\rm MoCaE}}{M_{\rm MoE_{full}}} = \frac{E - (1+k)}{E}

For E=3,k=1E=3, k=1 this amounts to a theoretical 33%33\% saving, with empirical reductions exceeding 40%40\% due to additional inefficiencies in the MoE baseline. Latency reductions scale approximately with k/Ek/E.

Empirical results on LLaMA-2-7B (RunPod L40s, 48GB VRAM) show:

Configuration Training Time (s) Inference Time (s) Memory (MB)
3-expert MoE (dense, baseline) (H³Fusion) 7260 7.26 N/A
TrinityX MoCaE (top-1) 1316 4.68 1721
Full pipeline (fine-tune+MoCaE+reg) 1437 3.10 1709

This demonstrates up to 57%57\% reduction in inference time and 40%40\% in memory.

4. Empirical Performance and Ablation Findings

TrinityX achieves substantial gains over H³Fusion and other strong MoE-based baselines across all alignment axes. Main benchmark findings include:

Model Alpaca WR (%) BeaverTails SS (%) (↓) TruthfulQA TI (%)
H³Fusion (3-expert dense MoE) 13.79 42.00 18.82
TrinityX (LLaMA-2-7B) 36.75 41.03 40.66
TrinityX (Mistral-7B) 83.42 38.10 74.83

Overall relative improvements versus baseline: +32.5%+32.5\% win rate, +33.9%+33.9\% in safety, +28.4%+28.4\% in truthfulness.

Ablation studies show removing the MoCaE module (uniform weights gi=1/3g_i = 1/3) decreases average score from 48.38%48.38\% to 38.11%38.11\%, indicating the necessity of calibrated routing. Eliminating regularization terms (entropy or KL) or gating-loss consistently reduces alignment metrics (e.g., SS degrades by 7 percentage points without KL). Disabling expert-specific low-rank adapters also degrades win rates. Top-1 routing consistently yields optimal performance: WR=93.33%=93.33\%, SS=23.17%=23.17\%, TI=75.00%=75.00\% under full regularization and gating.

Heatmaps of routing probabilities πi\pi_i reveal that the model dynamically specializes: honest prompts strongly activate the ETE_T (honesty) expert, safety-critical inputs drive ESE_S (harmlessness), and open-ended queries invoke EHE_H (helpfulness). Regularization sharpens this specialization.

5. Generalization Across LLM Backbones

TrinityX has been evaluated for portability across multiple LLM backbones. Fine-tuning on Mistral-7B, Gemma-7B, and DeepSeek-7B with unchanged MoCaE routing achieves performance parity with the results seen on LLaMA-2. On the HoneSet stress-test, DeepSeek-7B with TrinityX achieves WR=91.02%=91.02\%, SS=24.88%=24.88\%, TI=87.41%=87.41\%, Avg=57.85%=57.85\%. The router network α()\alpha(\cdot), learned initially on LLaMA-2, generalizes without re-engineering to other architectures, supporting the robustness of MoCaE’s calibrated gating.

6. Context and Implications for Alignment Research

TrinityX directly addresses intrinsic trade-offs in LLM alignment by explicitly encoding expert modularity for each HHH axis and regulating their fusion through a calibrated routing strategy. The composite training approach, entropic and temporal smoothing, and adapter-based expert definition together provide a practical blueprint for modular, multi-objective alignment. A plausible implication is that similar architectures may be extensible to other compositional alignment or skill transfer challenges in large-scale models, subject to gateway regularization and expert stability considerations (Kashyap et al., 10 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trinity Large MoE Model.