BTX: Modular LLMs via Branch-Train-MiX

Updated 23 February 2026

BTX is a modular framework that constructs domain-specialized LLMs by merging independently trained experts into a sparse MoE architecture with fine-grained token routing.
Its pipeline involves branching a seed model, asynchronous domain-specific expert training, and merging via dynamic MoE composition that optimizes routing per token.
Empirical results indicate BTX achieves state-of-the-art accuracy-efficiency tradeoffs, extending applications in code, mathematics, world knowledge, and multilingual tasks.

Branch-Train-MiX (BTX) is a modular framework for constructing efficient, domain-specialized LLMs by merging independently trained expert branches into a unified sparse Mixture-of-Experts (MoE) architecture with fine-grained token routing. BTX integrates high-throughput asynchronous expert pretraining with dynamic MoE composition and token-level routing, offering state-of-the-art accuracy–efficiency tradeoffs and extensibility for diverse domains such as code, mathematical reasoning, world knowledge, and multilingual applications (Sukhbaatar et al., 2024, Chamma et al., 13 Dec 2025).

1. Conceptual Overview and Motivation

Branch-Train-MiX (BTX) addresses the challenge of equipping LLMs with capabilities across multiple specialized domains without incurring the scalability and retraining costs of conventional dense or monolithic architectures. BTX generalizes and supersedes both the Branch-Train-Merge (BTM) and sparse upcycling methods by allowing embarrassingly parallel domain-specific expert training and flexible, trainable expert mixture at inference time.

BTX first generates multiple expert models, each pre-trained further on a domain-specific corpus, and subsequently merges these experts by directly composing their feed-forward (FFN) weights into MoE sublayers. The non-expert parameters, such as self-attention and embedding matrices, are averaged. Fine-grained token routing is then learned during a short MoE finetuning phase, realizing joint expertise while maintaining high throughput and strong parameter efficiency (Sukhbaatar et al., 2024, Chamma et al., 13 Dec 2025). The MixtureKit software framework enables users to operationalize BTX and visualize per-token expert usage (Chamma et al., 13 Dec 2025).

2. BTX Pipeline and Methodology

BTX proceeds in three sequential stages:

2.1 Branch

Begin with a pretrained seed model $\mathcal{M}$ (e.g., Llama-2 7B).
Specify $N$ target domains (e.g., mathematics, code, factual knowledge).
Create $N$ copies $\{\mathcal{M}_i\}_{i=1}^N$ of the seed. Optionally, retain the seed as a "generalist" expert to yield $N+1$ branches.

2.2 Train: Asynchronous Expert Training

Each branch $\mathcal{M}_i$ is independently trained (continued pretraining) on data $D_i$ for its assigned domain using the canonical language modeling loss:

$\mathcal{L}_{\rm LM}(\theta_i) = -\sum_{t=1}^T \log p_{\theta_i}(x_t \mid x_{<t})$

No inter-branch synchronization or communication is performed, enabling maximum parallel throughput and linear scaling with GPU resources. Failures in one branch do not impede others.
Empirically, $\sim$ 200B tokens per branch yields significant domain expertise (e.g., code, math, knowledge) (Sukhbaatar et al., 2024).

2.3 MiX: MoE Merging and Fine-Tuning

The feed-forward sublayers (FFNs) from each expert become the set of MoE experts for each Transformer layer.
Non-MoE parameters (self-attention, embedding, layer normalization) are averaged:

$\theta^{\rm SA}_\ell = \frac{1}{N} \sum_{i=1}^N \theta^{\rm SA}_{i,\ell}$

Each layer's MoE module is initialized; router parameters $W_\ell$ are randomly initialized.
MoE finetuning is performed on the union of all domains, optimizing both router parameters and, optionally, slight updates to backbone parameters. The MoE output at layer $\ell$ for token $x$ :

$\mathtt{FF}_{\rm MoE}^\ell(x) = \sum_{i=1}^N g_i(W_\ell x) \;\mathtt{FF}_i^\ell(x)$

where $g$ is a sparse gating function implementing Top- $k$ routing. Load balancing is enforced via a Switch Transformer-style penalty with weight $\alpha \approx 0.01$ (Sukhbaatar et al., 2024).

3. Formal Model and Routing

BTX's distinctive MoE composition operates as follows:

Stage	Operation	Mathematical Formulation
Feed-forward expert	Per-branch FFN at layer $\ell$	$\mathtt{FF}_i^\ell(x) = W_{2,i}^\ell[\mathrm{GELU}(W_{1,i}^\ell x)]$
MoE Aggregation	Weighted mixture per token/layer	$\mathtt{FF}_{\rm MoE}^\ell(x) = \sum_{i=1}^N g_i(W_\ell x) \mathtt{FF}_i^\ell(x)$
Sparse Routing	Top- $k$ expert selection, softmax gating	$g(z) = \mathrm{Softmax}(\mathrm{TopK}(z)),\ z = W_\ell x$
Parameter Averaging	Non-expert parameter merge	$\theta_{\rm SA}^\ell = \frac{1}{N} \sum_{i=1}^N \theta_{i, \rm SA}^\ell$

During MoE finetuning, the token-level router $W_\ell$ for each layer selects at most $k$ experts per token; only those experts are evaluated. The loss combines the standard language modeling criterion with a load-balance penalty to prevent expert underutilization:

$\mathcal{L} = \mathcal{L}_{\rm LM} + \sum_\ell \mathcal{L}_{\rm LB}^{(\ell)}$

Empirically, 20–80 billion finetuning tokens suffice to learn effective routing without eroding domain specialization (Sukhbaatar et al., 2024).

In MixtureKit’s extended BTX implementation (Chamma et al., 13 Dec 2025), routers can be placed at all three sub-layers ("gate," "up," "down") of each FFN, enabling sub-projection-specific, per-token, per-layer routing. Each projection $p$ in block $\ell$ is assigned its own router $W^{(\ell, p)}$ , further increasing routing granularity.

4. Comparative Empirical Performance

Extensive benchmark results demonstrate BTX's superior accuracy–efficiency frontier relative to alternative approaches. Representative results (using Llama-2 7B as seed, with domain experts in math, code, knowledge):

Method	Active Params	Avg Score (5 cat.)	Wall Clock / GPU-days	MoE Compute Share
Llama-2 7B	6.7B	40.7	0	0%
Dense	6.7B	44.5	+—	100%
Sparse Upcycling	19.7B	46.3	7.9d/1,007	100%
Branch-Train-Merge	6.7B	43.4
BTX (Top-1)	6.7B	47.3	7.8d/926	23%
BTX (Top-2)	11.1B	47.9	7.8d/926	23%

BTX (Top-2) attains the highest average score (47.9), requiring only 23% of compute in the MoE stage and ingesting approximately twice as many tokens per total GPU-day as sparse upcycling approaches (Sukhbaatar et al., 2024). In script-specialized setups, BTX-based merged models (3×4B expanding to 6B active) can match or outperform substantially larger dense models on translation and transliteration metrics (Chamma et al., 13 Dec 2025).

5. Routing Architecture, Implementation, and Analysis

BTX's routing architecture differentiates itself from classical MoE frameworks:

Per-Sublayer Routing: MixtureKit extends BTX by assigning independent routers to every internal linear sub-projection (gate, up, down) within each FFN, permitting distinct experts for different projection components and maximizing token-level flexibility (Chamma et al., 13 Dec 2025).
Automated Model Surgery: MixtureKit enables automatic patching of base model configurations to insert MoE routers, aggregate expert FFNs, and update forward passes. Users specify the routing method (“btx”), expert set, and router layers via configuration.
Expert Usage Visualization: Tokenwise expert contributions $\overline{w}_{t,e}$ are visualized as color-coded overlays or bar plots, indicating dynamic expert specialization and identifying routing collapse or dead experts (Chamma et al., 13 Dec 2025).

6. Applications, Guidelines, and Limitations

BTX is broadly applicable to multilingual, code-switching, domain adaptation, and continual learning scenarios:

Multilingual/code-switched modeling: Assign each language/script its own expert; BTX dynamically routes tokens.
Modular domain adaptation: Integrate domain-experts (e.g., legal, medical) with minimal backbone modification.
Continual integration: New domains can be added by training extra experts without revisiting earlier domains.
Scalability: Throughput scales linearly with available compute due to asynchronous expert training.
Router configuration: Traders between memory/computation and flexibility include limiting router insertion to select sublayers ("up_proj" only) and tuning routing sparsity parameter $k$ for optimal efficiency (e.g., $k=1$ for maximal sparsity, $k=2$ for practical LLMs) (Chamma et al., 13 Dec 2025).

Limitations include experiments restricted to smaller $N$ , domain tie constraints per expert, and absence of instruction tuning or RLHF integration (not yet investigated) (Sukhbaatar et al., 2024).

7. Theoretical Generalizations and Future Research

BTX unifies core strengths of "branch-train-merge" and sparse MoE paradigms, supporting further generalization:

BTX recovers branch-train-merge when MoE finetuning and learned routing are omitted.
BTX specializes to sparse upcycling when asynchronous expert training is skipped.
Potential advances include scaling BTX to more domains, incorporating instruction tuning, and exploring router designs with capacity or load balancing constraints.

BTX represents a modular strategy for scaling LLM capabilities via compositional expert training and sparse MoE routing, with empirical superiority demonstrated across a range of specialization benchmarks (Sukhbaatar et al., 2024, Chamma et al., 13 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM (2024)

MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Branch-Train-MiX (BTX).