Arcee Trinity Large: Sparse MoE LLM
- Arcee Trinity Large is a sparse Mixture-of-Experts LLM with a 400B parameter decoder-only Transformer architecture and 3.25% active density.
- It employs advanced techniques including sigmoid routing, interleaved local/global attention, and SMEBU load balancing to stabilize expert bias updates.
- Pre-trained on 17 trillion tokens, it excels on benchmarks across code, math, and reasoning, with all checkpoints publicly available for research.
Arcee Trinity Large is a sparse Mixture-of-Experts (MoE) LLM featuring a 400 billion parameter decoder-only Transformer architecture, with only 13 billion parameters activated per token via its extreme MoE sparsity (3.25% active density). It is the flagship member of the Trinity model family, which also includes Trinity Nano (6B total, 1B activated) and Trinity Mini (26B total, 3B activated). Trinity Large’s architecture introduces interleaved local and global attention layers, gated attention, depth-scaled sandwich normalization, sigmoid routing for MoE expert selection, and the novel Soft-clamped Momentum Expert Bias Updates (SMEBU) load-balancing strategy. The model was pre-trained on 17 trillion tokens with zero loss spikes, and all model checkpoints are publicly hosted.
1. Model Architecture and MoE Topology
Trinity Large is structured as a 66-layer decoder-only Transformer. The architecture consists of an initial 6 dense (standard) layers, followed by 60 Transformer layers with interspersed MoE layers. Within each MoE layer, expert structure is as follows:
- 1 shared expert and 256 routed experts.
- Only routed experts are selected per token.
- Feed-forward network (FFN) hidden dimension of 3072 for both expert and model.
- Output of an MoE layer:
Expert selection is governed by sigmoid routing, with routing scores computed as and top- selection with expert bias. The gating formula is where is nonzero only for top-ranked, bias-adjusted experts.
Interleaved local/global attention is configured such that every block of four layers contains three local attention layers (sliding window size 4096 with rotary position embedding, RoPE) and a global attention layer (full causal attention, unparameterized for position). Grouped-query attention (GQA) uses 48 query heads () and 8 key/value heads ().
Gated attention is applied post scaled-dot-product attention using element-wise sigmoid gates, expressed as , which modulates the attention output and stabilizes activations for improved long-sequence generalization.
Depth-scaled sandwich normalization incorporates RMSNorm before and after each sublayer’s core function. Second RMSNorm’s gain is initialized to (where ), which controls output growth across depth.
2. Sigmoid Routing and Expert Selection
The MoE routing mechanism employs sigmoid routing for expert selection. For each token and routed expert , sigmoid routing computes score , and then selects the top resulting experts (according to ). The normalized MoE gating weights are applied only to these selected experts.
This routing procedure provides smoother selection compared to softmax gating, with bias parameters facilitating load balancing across experts. The inclusion of a shared expert ensures consistent universal computation, even as expert activation varies by token.
3. SMEBU: Soft-Clamped Momentum Expert Bias Updates
Trinity Large introduces the SMEBU algorithm to address instability and oscillation in standard MoE expert bias updates, particularly with hundreds of experts. Standard update schemes (e.g., ) were observed to oscillate and destabilize routing in large, sparse MoE.
SMEBU improves upon this via:
- Normalized load violation:
- Soft-clamping with scaling:
- Zero-centering: , with
- Momentum smoothing:
Typical hyperparameters for Trinity Large were , , and .
This approach stabilizes router dynamics and enables load-balanced training without auxiliary losses, which is crucial for large sparse MoEs.
4. Training Regimen and Data
Training utilized the Muon optimizer for the transformer hidden layers, and AdamW for embeddings and LM head parameters, with a learning-rate adjustment per update given by
Schedule specifics:
- Linear warmup over 2,000 steps to peak learning rates: (Muon), (AdamW).
- Batch size of 12,288 tokens with sequence length 8,192, increased to 16,384 batch after 4.9 trillion tokens.
- Decay schedule uses cosine annealing to 1/10 of peak, and further for context extension.
- The model was pre-trained on 17 trillion tokens drawn from a 20T-token mixture of web, math, code, STEM, multilingual, and synthetic data (including 6.5T web rephrasings, 1T multilingual, and 0.8T code).
- A 200K-token BPE tokenizer and Random Sequential Document Buffer (RSDB) mitigated minibatch imbalance.
- Trinity Large exhibited zero loss spikes throughout training.
5. Evaluation, Benchmarks, and Inference
Trinity Large demonstrated strong performance across code, math, reasoning, and knowledge benchmarks. Table 1 compiles key metrics from the report.
| Benchmark | Base Trinity Large | Instruct-Tuned Preview |
|---|---|---|
| MBPP+ | 88.62 | — |
| MATH500 (Minerva) | 65.20 | — |
| HellaSwag (5-shot) | 90.11 | — |
| WinoGrande (5-shot) | 80.82 | — |
| MMLU (5-shot) | 82.58 | 87.21 |
| MMLU-Pro | 66.02 | 75.25 |
| TriviaQA (5-shot) | 83.30 | — |
| ARC-Chall (0-shot) | 65.44 | — |
| BBH (few-shot) | 65.70 | — |
| GPQA Diamond | 43.94 | 63.32 |
| SimpleQA | — | 23.92 |
| AIME25 | — | 24.36 |
Inference throughput (vLLM, FP8 quantized on 8×H200) outperforms similarly sized open models, attributed to the model's MoE sparsity and the interleaved local/global attention design.
6. Model Availability and Variants
All Trinity family model checkpoints, including Trinity Large (base and instruct-tuned preview) as well as Trinity Mini and Trinity Nano, are publicly available at https://huggingface.co/arcee-ai. Smaller variants retain core architectural strategies (MoE, interleaved attention, normalization), differing primarily in total/activated parameter count and training token volume.
A plausible implication is that the publicly available checkpoints and robust training regimen (17T tokens, zero loss spikes) may facilitate broad evaluation, adaptation, and further research on extreme MoE LLMs in both academic and applied contexts (Singh et al., 19 Feb 2026).