TrinityX: Calibrated MoE Alignment Model
- The paper introduces TrinityX, which integrates calibrated routing among expert FFNs to optimize LLM alignment for helpfulness, harmlessness, and honesty.
- TrinityX leverages top-1 routing and dual regularization (entropy and KL) to achieve up to 57% reduction in inference time and 40% memory savings over dense MoE models.
- The model generalizes across various LLM backbones by employing modular low-rank adapters and a composite training strategy, enhancing alignment and scalability.
TrinityX, a Mixture-of-Calibrated Experts (MoCaE) framework, is a Transformer-based LLM specifically designed to optimize alignment across the three desiderata of Helpfulness, Harmlessness, and Honesty (HHH). It systematically addresses the limitations of prior Mixture-of-Experts (MoE) architectures in LLM alignment by integrating expert modularity with a calibrated routing mechanism. TrinityX demonstrates substantial empirical improvements in standard alignment benchmarks, while achieving significant runtime and memory efficiency gains compared to dense MoE and baseline architectures (Kashyap et al., 10 Sep 2025).
1. Model Architecture
TrinityX modifies the standard Transformer by replacing every Feed-Forward Network (FFN) block with a dedicated MoCaE module. Each module incorporates three independently specialized expert FFNs , , and , responsible for optimizing Helpfulness, Safety (Harmlessness), and Truthfulness (Honesty), respectively. Given an input token sequence , the Transformer’s attention sublayer yields a hidden state , which is simultaneously processed by all three expert FFNs:
A lightweight router network computes logits for each expert, transformed via a temperature-scaled softmax to obtain routing probabilities :
Each expert carries a fixed importance weight , used to calibrate the routing:
The fused expert output is constructed as:
This output is layer-normalized and added to the residual before being forwarded to the next block.
Calibration of the router is reinforced via two regularization terms: an entropy penalty,
which prevents premature specialization, and a temporal KL penalty,
which discourages abrupt gating shifts across tokens or layers. The router's objective is thus a combination of task loss and these regularizations:
2. Training Procedure and Alignment Objectives
TrinityX training proceeds in two phases:
(a) Task-vector fine-tuning: The base Transformer's parameters are frozen, and for each alignment target , a low-rank adapter is learned on a designated dataset . This results in three experts, each trained independently via cross-entropy on their respective domains:
(b) Joint MoCaE calibration: With expert parameters fixed or lightly fine-tuned, the router is trained end-to-end on a composite objective:
Weights are typically set equal. The back-propagation of this loss enables the router to assign expert contributions adaptively per input context and alignment trade-off.
3. Computational Efficiency
TrinityX’s design confers significant memory and speed optimizations. In standard dense Transformers, each FFN imposes a per-layer memory cost of . A naïve -expert MoE results in cost. MoCaE activates only experts per token (typically via top-1 routing), yielding:
The relative memory saving is:
For this amounts to a theoretical saving, with empirical reductions exceeding due to additional inefficiencies in the MoE baseline. Latency reductions scale approximately with .
Empirical results on LLaMA-2-7B (RunPod L40s, 48GB VRAM) show:
| Configuration | Training Time (s) | Inference Time (s) | Memory (MB) |
|---|---|---|---|
| 3-expert MoE (dense, baseline) (H³Fusion) | 7260 | 7.26 | N/A |
| TrinityX MoCaE (top-1) | 1316 | 4.68 | 1721 |
| Full pipeline (fine-tune+MoCaE+reg) | 1437 | 3.10 | 1709 |
This demonstrates up to reduction in inference time and in memory.
4. Empirical Performance and Ablation Findings
TrinityX achieves substantial gains over H³Fusion and other strong MoE-based baselines across all alignment axes. Main benchmark findings include:
| Model | Alpaca WR (%) | BeaverTails SS (%) (↓) | TruthfulQA TI (%) |
|---|---|---|---|
| H³Fusion (3-expert dense MoE) | 13.79 | 42.00 | 18.82 |
| TrinityX (LLaMA-2-7B) | 36.75 | 41.03 | 40.66 |
| TrinityX (Mistral-7B) | 83.42 | 38.10 | 74.83 |
Overall relative improvements versus baseline: win rate, in safety, in truthfulness.
Ablation studies show removing the MoCaE module (uniform weights ) decreases average score from to , indicating the necessity of calibrated routing. Eliminating regularization terms (entropy or KL) or gating-loss consistently reduces alignment metrics (e.g., SS degrades by 7 percentage points without KL). Disabling expert-specific low-rank adapters also degrades win rates. Top-1 routing consistently yields optimal performance: WR, SS, TI under full regularization and gating.
Heatmaps of routing probabilities reveal that the model dynamically specializes: honest prompts strongly activate the (honesty) expert, safety-critical inputs drive (harmlessness), and open-ended queries invoke (helpfulness). Regularization sharpens this specialization.
5. Generalization Across LLM Backbones
TrinityX has been evaluated for portability across multiple LLM backbones. Fine-tuning on Mistral-7B, Gemma-7B, and DeepSeek-7B with unchanged MoCaE routing achieves performance parity with the results seen on LLaMA-2. On the HoneSet stress-test, DeepSeek-7B with TrinityX achieves WR, SS, TI, Avg. The router network , learned initially on LLaMA-2, generalizes without re-engineering to other architectures, supporting the robustness of MoCaE’s calibrated gating.
6. Context and Implications for Alignment Research
TrinityX directly addresses intrinsic trade-offs in LLM alignment by explicitly encoding expert modularity for each HHH axis and regulating their fusion through a calibrated routing strategy. The composite training approach, entropic and temporal smoothing, and adapter-based expert definition together provide a practical blueprint for modular, multi-objective alignment. A plausible implication is that similar architectures may be extensible to other compositional alignment or skill transfer challenges in large-scale models, subject to gateway regularization and expert stability considerations (Kashyap et al., 10 Sep 2025).