TrinityX: Calibrated MoE Alignment Model

Updated 19 February 2026

The paper introduces TrinityX, which integrates calibrated routing among expert FFNs to optimize LLM alignment for helpfulness, harmlessness, and honesty.
TrinityX leverages top-1 routing and dual regularization (entropy and KL) to achieve up to 57% reduction in inference time and 40% memory savings over dense MoE models.
The model generalizes across various LLM backbones by employing modular low-rank adapters and a composite training strategy, enhancing alignment and scalability.

TrinityX, a Mixture-of-Calibrated Experts (MoCaE) framework, is a Transformer-based LLM specifically designed to optimize alignment across the three desiderata of Helpfulness, Harmlessness, and Honesty (HHH). It systematically addresses the limitations of prior Mixture-of-Experts (MoE) architectures in LLM alignment by integrating expert modularity with a calibrated routing mechanism. TrinityX demonstrates substantial empirical improvements in standard alignment benchmarks, while achieving significant runtime and memory efficiency gains compared to dense MoE and baseline architectures (Kashyap et al., 10 Sep 2025).

1. Model Architecture

TrinityX modifies the standard Transformer by replacing every Feed-Forward Network (FFN) block with a dedicated MoCaE module. Each module incorporates three independently specialized expert FFNs $E_H$ , $E_S$ , and $E_T$ , responsible for optimizing Helpfulness, Safety (Harmlessness), and Truthfulness (Honesty), respectively. Given an input token sequence $x$ , the Transformer’s attention sublayer yields a hidden state $h \in \mathbb{R}^d$ , which is simultaneously processed by all three expert FFNs:

$y_H = E_H(h), \quad y_S = E_S(h), \quad y_T = E_T(h)$

A lightweight router network $\alpha(\cdot)$ computes logits $z_i$ for each expert, transformed via a temperature-scaled softmax to obtain routing probabilities $\pi_i$ :

$z_i = W_r^{(i)} h + b_r^{(i)} \quad\forall i \in \{H, S, T\}$

$\pi_i = \frac{\exp(z_i/\tau)}{\sum_j \exp(z_j/\tau)}$

Each expert carries a fixed importance weight $\tilde\gamma_i$ , used to calibrate the routing:

$g_i(h) = \pi_i \cdot \tilde\gamma_i, \quad \sum_i g_i(h) = 1$

The fused expert output is constructed as:

$h_{\mathrm{trinity}} = \sum_{i \in \{H, S, T\}} g_i(h) E_i(h)$

This output is layer-normalized and added to the residual $h$ before being forwarded to the next block.

Calibration of the router is reinforced via two regularization terms: an entropy penalty,

$\mathcal{L}_{\rm entropy} = -\sum_i \pi_i \log \pi_i$

which prevents premature specialization, and a temporal KL penalty,

$\mathcal{L}_{\rm KL} = \textrm{KL}(\pi \| \pi_{\rm prev})$

which discourages abrupt gating shifts across tokens or layers. The router's objective is thus a combination of task loss and these regularizations:

$\mathcal{L}_{\rm router} = \mathcal{L}_{\rm task} + \lambda_1 \mathcal{L}_{\rm entropy} + \lambda_2 \mathcal{L}_{\rm KL}$

2. Training Procedure and Alignment Objectives

TrinityX training proceeds in two phases:

(a) Task-vector fine-tuning: The base Transformer's parameters $\theta_0$ are frozen, and for each alignment target $i \in \{\mathrm{helpful, harmless, honest}\}$ , a low-rank adapter $\mathcal{T}_i \in \mathbb{R}^{r \times d}$ is learned on a designated dataset $\mathcal{D}_i$ . This results in three experts, each trained independently via cross-entropy on their respective domains:

$\mathcal{L}_H = -\mathbb{E}_{(x,y)\sim\mathcal{D}_{\rm helpful}}\log p_{E_H}(y|x)$

$\mathcal{L}_S = -\mathbb{E}_{(x,s)\sim\mathcal{D}_{\rm harmless}}\sum_{c \in \{\text{safe},\text{unsafe}\}}\mathbf{1}_{s=c}\log p_{E_S}(c|x)$

$\mathcal{L}_T = -\mathbb{E}_{(x,t)\sim\mathcal{D}_{\rm honest}}\sum_{c \in \{\text{truthful},\text{hallucinated}\}}\mathbf{1}_{t=c}\log p_{E_T}(c|x)$

(b) Joint MoCaE calibration: With expert parameters fixed or lightly fine-tuned, the router is trained end-to-end on a composite objective:

$\mathcal{L}_{\rm total} = \omega_H \mathcal{L}_H + \omega_S \mathcal{L}_S + \omega_T \mathcal{L}_T + \lambda_1 \mathcal{L}_{\rm entropy} + \lambda_2 \mathcal{L}_{\rm KL}$

Weights $\omega_i$ are typically set equal. The back-propagation of this loss enables the router to assign expert contributions adaptively per input context and alignment trade-off.

3. Computational Efficiency

TrinityX’s design confers significant memory and speed optimizations. In standard dense Transformers, each FFN imposes a per-layer memory cost of $8d^2$ . A naïve $E$ -expert MoE results in $E \cdot 8d^2$ cost. MoCaE activates only $k \ll E$ experts per token (typically $k=1$ via top-1 routing), yielding:

$M_{\rm MoCaE} \approx 8d^2 + k \times 8d^2 = (1+k) 8d^2$

The relative memory saving is:

$\frac{M_{\rm MoE_{full}} - M_{\rm MoCaE}}{M_{\rm MoE_{full}}} = \frac{E - (1+k)}{E}$

For $E=3, k=1$ this amounts to a theoretical $33\%$ saving, with empirical reductions exceeding $40\%$ due to additional inefficiencies in the MoE baseline. Latency reductions scale approximately with $k/E$ .

Empirical results on LLaMA-2-7B (RunPod L40s, 48GB VRAM) show:

Configuration	Training Time (s)	Inference Time (s)	Memory (MB)
3-expert MoE (dense, baseline) (H³Fusion)	7260	7.26	N/A
TrinityX MoCaE (top-1)	1316	4.68	1721
Full pipeline (fine-tune+MoCaE+reg)	1437	3.10	1709

This demonstrates up to $57\%$ reduction in inference time and $40\%$ in memory.

4. Empirical Performance and Ablation Findings

TrinityX achieves substantial gains over H³Fusion and other strong MoE-based baselines across all alignment axes. Main benchmark findings include:

Model	Alpaca WR (%)	BeaverTails SS (%) (↓)	TruthfulQA TI (%)
H³Fusion (3-expert dense MoE)	13.79	42.00	18.82
TrinityX (LLaMA-2-7B)	36.75	41.03	40.66
TrinityX (Mistral-7B)	83.42	38.10	74.83

Overall relative improvements versus baseline: $+32.5\%$ win rate, $+33.9\%$ in safety, $+28.4\%$ in truthfulness.

Ablation studies show removing the MoCaE module (uniform weights $g_i = 1/3$ ) decreases average score from $48.38\%$ to $38.11\%$ , indicating the necessity of calibrated routing. Eliminating regularization terms (entropy or KL) or gating-loss consistently reduces alignment metrics (e.g., SS degrades by 7 percentage points without KL). Disabling expert-specific low-rank adapters also degrades win rates. Top-1 routing consistently yields optimal performance: WR $=93.33\%$ , SS $=23.17\%$ , TI $=75.00\%$ under full regularization and gating.

Heatmaps of routing probabilities $\pi_i$ reveal that the model dynamically specializes: honest prompts strongly activate the $E_T$ (honesty) expert, safety-critical inputs drive $E_S$ (harmlessness), and open-ended queries invoke $E_H$ (helpfulness). Regularization sharpens this specialization.

5. Generalization Across LLM Backbones

TrinityX has been evaluated for portability across multiple LLM backbones. Fine-tuning on Mistral-7B, Gemma-7B, and DeepSeek-7B with unchanged MoCaE routing achieves performance parity with the results seen on LLaMA-2. On the HoneSet stress-test, DeepSeek-7B with TrinityX achieves WR $=91.02\%$ , SS $=24.88\%$ , TI $=87.41\%$ , Avg $=57.85\%$ . The router network $\alpha(\cdot)$ , learned initially on LLaMA-2, generalizes without re-engineering to other architectures, supporting the robustness of MoCaE’s calibrated gating.

6. Context and Implications for Alignment Research

TrinityX directly addresses intrinsic trade-offs in LLM alignment by explicitly encoding expert modularity for each HHH axis and regulating their fusion through a calibrated routing strategy. The composite training approach, entropic and temporal smoothing, and adapter-based expert definition together provide a practical blueprint for modular, multi-objective alignment. A plausible implication is that similar architectures may be extensible to other compositional alignment or skill transfer challenges in large-scale models, subject to gateway regularization and expert stability considerations (Kashyap et al., 10 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Too Helpful, Too Harmless, Too Honest or Just Right? (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trinity Large MoE Model.

TrinityX: Calibrated MoE Alignment Model

1. Model Architecture

2. Training Procedure and Alignment Objectives

3. Computational Efficiency

4. Empirical Performance and Ablation Findings

5. Generalization Across LLM Backbones

6. Context and Implications for Alignment Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TrinityX: Calibrated MoE Alignment Model

1. Model Architecture

2. Training Procedure and Alignment Objectives

3. Computational Efficiency

4. Empirical Performance and Ablation Findings

5. Generalization Across LLM Backbones

6. Context and Implications for Alignment Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research