Arcee Trinity Large: Sparse MoE LLM

Updated 25 February 2026

Arcee Trinity Large is a sparse Mixture-of-Experts LLM with a 400B parameter decoder-only Transformer architecture and 3.25% active density.
It employs advanced techniques including sigmoid routing, interleaved local/global attention, and SMEBU load balancing to stabilize expert bias updates.
Pre-trained on 17 trillion tokens, it excels on benchmarks across code, math, and reasoning, with all checkpoints publicly available for research.

Arcee Trinity Large is a sparse Mixture-of-Experts (MoE) LLM featuring a 400 billion parameter decoder-only Transformer architecture, with only 13 billion parameters activated per token via its extreme MoE sparsity (3.25% active density). It is the flagship member of the Trinity model family, which also includes Trinity Nano (6B total, 1B activated) and Trinity Mini (26B total, 3B activated). Trinity Large’s architecture introduces interleaved local and global attention layers, gated attention, depth-scaled sandwich normalization, sigmoid routing for MoE expert selection, and the novel Soft-clamped Momentum Expert Bias Updates (SMEBU) load-balancing strategy. The model was pre-trained on 17 trillion tokens with zero loss spikes, and all model checkpoints are publicly hosted.

1. Model Architecture and MoE Topology

Trinity Large is structured as a 66-layer decoder-only Transformer. The architecture consists of an initial 6 dense (standard) layers, followed by 60 Transformer layers with interspersed MoE layers. Within each MoE layer, expert structure is as follows:

1 shared expert and 256 routed experts.
Only $K_n=4$ routed experts are selected per token.
Feed-forward network (FFN) hidden dimension of 3072 for both expert and model.
Output of an MoE layer:

$h'_t = u_t + \sum_{i=1}^{N_s} \mathrm{FFN}^{(s)}_i(u_t) + \sum_{i=1}^{N_r} g_{i,t} \cdot \mathrm{FFN}^{(r)}_i(u_t)$

Expert selection is governed by sigmoid routing, with routing scores computed as $s_{i,t} = \sigma(u_t^\top e_i)$ and top- $K_n$ selection with expert bias. The gating formula is $g_{i,t} = {g'_{i,t}} / {\sum_j g'_{j,t}}$ where $g'_{i,t}$ is nonzero only for top-ranked, bias-adjusted experts.

Interleaved local/global attention is configured such that every block of four layers contains three local attention layers (sliding window size 4096 with rotary position embedding, RoPE) and a global attention layer (full causal attention, unparameterized for position). Grouped-query attention (GQA) uses 48 query heads ( $h_q$ ) and 8 key/value heads ( $h_{kv}$ ).

Gated attention is applied post scaled-dot-product attention using element-wise sigmoid gates, expressed as $g_t = \sigma(W^G x_t)$ , which modulates the attention output and stabilizes activations for improved long-sequence generalization.

Depth-scaled sandwich normalization incorporates RMSNorm before and after each sublayer’s core function. Second RMSNorm’s gain is initialized to $1/\sqrt{L}$ (where $L=60$ ), which controls output growth across depth.

2. Sigmoid Routing and Expert Selection

The MoE routing mechanism employs sigmoid routing for expert selection. For each token $t$ and routed expert $i$ , sigmoid routing computes score $s_{i,t}$ , and then selects the top $K_n$ resulting experts (according to $s_{i,t}+b_i$ ). The normalized MoE gating weights are applied only to these selected experts.

This routing procedure provides smoother selection compared to softmax gating, with bias parameters $b_i$ facilitating load balancing across experts. The inclusion of a shared expert ensures consistent universal computation, even as expert activation varies by token.

3. SMEBU: Soft-Clamped Momentum Expert Bias Updates

Trinity Large introduces the SMEBU algorithm to address instability and oscillation in standard MoE expert bias updates, particularly with hundreds of experts. Standard update schemes (e.g., $\Delta b_i = \gamma \cdot \mathrm{sign}(\bar n - n_i)$ ) were observed to oscillate and destabilize routing in large, sparse MoE.

SMEBU improves upon this via:

Normalized load violation: $v_i = (\bar n - n_i)/\bar n$
Soft-clamping with $\tanh$ scaling: $\tilde{v}_i = \tanh(\kappa \cdot v_i)$
Zero-centering: $\Delta b_i = \lambda \tilde{v}_i$ , with $\Delta b_i \leftarrow \Delta b_i - (1/N_r) \sum_j \Delta b_j$
Momentum smoothing: $m_i \leftarrow \beta m_i + (1-\beta)\Delta b_i;\; b_i \leftarrow b_i + m_i$

Typical hyperparameters for Trinity Large were $\lambda = 5 \times 10^{-4}$ , $\kappa = 2$ , and $\beta = 0.5$ .

This approach stabilizes router dynamics and enables load-balanced training without auxiliary losses, which is crucial for large sparse MoEs.

4. Training Regimen and Data

Training utilized the Muon optimizer for the transformer hidden layers, and AdamW for embeddings and LM head parameters, with a learning-rate adjustment per update given by

$\mathrm{lr}_{\mathrm{adjusted}} = \mathrm{lr} \cdot \sqrt{\max(1, \text{fan\_out} / \text{fan\_in})}$

Schedule specifics:

Linear warmup over 2,000 steps to peak learning rates: $8 \times 10^{-4}$ (Muon), $2 \times 10^{-4}$ (AdamW).
Batch size of 12,288 tokens with sequence length 8,192, increased to 16,384 batch after 4.9 trillion tokens.
Decay schedule uses cosine annealing to 1/10 of peak, and further for context extension.
The model was pre-trained on 17 trillion tokens drawn from a 20T-token mixture of web, math, code, STEM, multilingual, and synthetic data (including 6.5T web rephrasings, 1T multilingual, and 0.8T code).
A 200K-token BPE tokenizer and Random Sequential Document Buffer (RSDB) mitigated minibatch imbalance.
Trinity Large exhibited zero loss spikes throughout training.

5. Evaluation, Benchmarks, and Inference

Trinity Large demonstrated strong performance across code, math, reasoning, and knowledge benchmarks. Table 1 compiles key metrics from the report.

Benchmark	Base Trinity Large	Instruct-Tuned Preview
MBPP+	88.62	—
MATH500 (Minerva)	65.20	—
HellaSwag (5-shot)	90.11	—
WinoGrande (5-shot)	80.82	—
MMLU (5-shot)	82.58	87.21
MMLU-Pro	66.02	75.25
TriviaQA (5-shot)	83.30	—
ARC-Chall (0-shot)	65.44	—
BBH (few-shot)	65.70	—
GPQA Diamond	43.94	63.32
SimpleQA	—	23.92
AIME25	—	24.36

Inference throughput (vLLM, FP8 quantized on 8×H200) outperforms similarly sized open models, attributed to the model's MoE sparsity and the interleaved local/global attention design.

6. Model Availability and Variants

All Trinity family model checkpoints, including Trinity Large (base and instruct-tuned preview) as well as Trinity Mini and Trinity Nano, are publicly available at https://huggingface.co/arcee-ai. Smaller variants retain core architectural strategies (MoE, interleaved attention, normalization), differing primarily in total/activated parameter count and training token volume.

A plausible implication is that the publicly available checkpoints and robust training regimen (17T tokens, zero loss spikes) may facilitate broad evaluation, adaptation, and further research on extreme MoE LLMs in both academic and applied contexts (Singh et al., 19 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Arcee Trinity Large Technical Report (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Arcee Trinity Large.