Mixture of Causal Expert Transformer (MoCET)

Updated 6 February 2026

The paper introduces MoCET, a transformer architecture that assigns a dedicated, lightweight expert to each token position for strict causal dependency modeling.
It decouples global context via a Meta Image Transformer and positional dependency modeling through specialized transformer blocks to enhance autoregressive token generation.
MoCET improves visual token coherence and conditional entropy reduction, leading to better sample quality and FID scores in image generation tasks.

The Mixture of Causal Expert Transformer (MoCET) is a position-specialized, ensemble-based transformer architecture developed for discrete token generation in visual tokenization workflows. Unlike conventional monolithic or mixture-of-expert (MoE) transformers relying on learned soft gating, MoCET assigns a dedicated, lightweight transformer block to each token position, facilitating strict causal dependency modeling and alignment with autoregressive generation schemes. MoCET constitutes a core component of NativeTok—a visual tokenization and image generation framework designed to enforce causal dependencies among image tokens, thereby improving generation coherence and downstream model performance (Wu et al., 30 Jan 2026).

1. Architectural Overview

MoCET resolves the joint token generation problem by maintaining a collection of transformer experts, each specialized for a single token position. Let $L$ denote the total number of tokens per image (e.g., $L=32,64,128$ ). The expert pool is formalized as $\mathbb{T} = \{T_1, T_2, \dots, T_L\}$ , with each $T_i$ a two-layer transformer block of hidden dimension $d_e$ (by default, $d_e=256$ ).

During token generation, only the $i$ -th expert $T_i$ is active when producing token $z_i$ . This decomposition enables separation of:

Global context modeling, achieved once by a Meta Image Transformer (MIT), producing a fixed image representation $X_\mathrm{latent} \in \mathbb{R}^{N_{\mathrm{lat}} \times d_{\mathrm{lat}}}$ .
Positional dependency modeling, performed by each $T_i$ based on prior tokens and the fixed latent.

For each step $i$ :

$h_i = T_i\Big([\; X_{\mathrm{latent}} \,;\, z_1 \,;\, \dots \,;\, z_{i-1} \,;\, e_i \;]\Big)$

where $e_i \in \mathbb{R}^{d_e}$ is a mask or padding embedding for position $i$ . The vector $h_i$ is then projected by a linear head $W_o \in \mathbb{R}^{d_e \times K}$ (with $K$ as the codebook size) to produce token logits.

2. Causal Modeling and Factorization

MoCET enforces a strict causal factorization of the joint distribution over tokens, conditioned on the MIT-derived latent:

$p(z_{1:L} \mid X_{\mathrm{latent}}) = \prod_{i=1}^L p(z_i \mid z_{<i}, X_{\mathrm{latent}})$

After computing $h_i$ , the conditional distribution for $z_i$ is:

$p(z_i = k \mid z_{<i}, X_{\mathrm{latent}}) = \mathrm{Softmax}_k(W_o^\top h_i)$

Token $z_i$ is selected either by maximum likelihood (teacher-forcing during training) or via categorical sampling at inference.

This architecture permits each expert $T_i$ to specialize in modeling the conditional distribution for its target position, aligning pretraining with the needs of autoregressive and masked-token generators.

3. Routing and Expert Assignment

Contrary to standard Mixture-of-Experts architectures, which use dynamically learned soft gates, MoCET employs fixed, one-hot routing. Expert $T_i$ is exclusively responsible for token $z_i$ , formalized via gate vector $g_i \in \{0,1\}^L$ with $(g_i)_j = 1$ iff $j=i$ , and $0$ otherwise:

$z_i = \sum_{j=1}^L (g_i)_j \, T_j([\ldots]) = T_i([\ldots])$

While soft or learned routing is theoretically possible (e.g., $\alpha_i = \mathrm{Softmax}[U [X_{\mathrm{latent}}; z_{<i}]]$ ), NativeTok’s instantiation of MoCET restricts each position to a dedicated expert, eliminating the ambiguity and overhead of dynamic expert allocation.

4. Training Objectives and Hierarchical Native Training

MoCET-based tokenization training in NativeTok optimizes two objectives concurrently:

Token cross-entropy loss over generated discrete tokens:

$\mathcal{L}_{\mathrm{CE}} = -\sum_{i=1}^L \log p(z_i^* \mid z_{<i}^*, X_{\mathrm{latent}})$

with $z_i^*$ as the ground-truth codebook index.

$\ell_2$ image reconstruction loss after passing ground-truth tokens through an image decoder $\mathrm{Dec}_\theta$ :

$\mathcal{L}_{\mathrm{rec}} = \| X - \mathrm{Dec}_\theta(z_{1:L}^*) \|_2^2$

The aggregate objective is

$\mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda \mathcal{L}_{\mathrm{rec}}$

where $\lambda$ trades off discrete-token accuracy and pixel-level reconstruction fidelity.

To scale to longer token sequences, Hierarchical Native Training (HNT) is adopted:

Train a base configuration (e.g., NativeTok $_{32}$ ) end-to-end.
Expand to larger $L$ (e.g., 64) by freezing MIT and initial experts, weight-cloning into new experts, and training only new experts and the decoder (resulting in training ≈56% of parameters).
Optionally fine-tune all parameters in the final phase to ensure expert harmonization.

5. Coordination with Meta Image Transformer

MoCET operates in conjunction with a Meta Image Transformer (MIT), which is a ViT-style encoder followed by an FNN for dimensionality adaptation. The workflow is:

$H = \mathrm{MIT}(X) \in \mathbb{R}^{N_{\mathrm{lat}} \times d_{\mathrm{high}}}$

$X_{\mathrm{latent}} = \mathrm{FNN}(H) \in \mathbb{R}^{N_{\mathrm{lat}} \times d_e}$

$X_{\mathrm{latent}}$ is fixed throughout MoCET token generation, enabling all experts to share access to a consistent, context-rich representation while maintaining strict autoregressive dependencies among tokens.

6. Implementation Characteristics and Performance

MoCET’s resource profile and speed are determined by the number of position-specific experts ( $L$ ) and their parameterization. Notable details:

Number of experts: $L\in\{32,64,128\}$
Each expert: 1 transformer layer, $d_e=256$ , 4 attention heads
MIT: 18 layers, hidden dimension 1024, output compressed via FNN
Image decoder: 24 layers, hidden dimension 1024
Total parameters: 616M ( $L=32$ ), 666M ( $L=64$ ), 766M ( $L=128$ )
Generation complexity per token: $O((N_{\mathrm{lat}}+i)^2 d_e)$
Encoding speed: approx. 120 samples/s ( $L=32$ ), 54 samples/s ( $L=64$ ), 39 samples/s ( $L=128$ ) on A800 × 4

The small overhead from MoCET arises from its use of $d_e \ll d_{\mathrm{high}}$ and moderate $N_{\mathrm{lat}}$ .

7. Significance for Visual Tokenization and Generation

MoCET systematically addresses the mismatch between bidirectional encoders and autoregressive generators by specializing token generation through position-specific experts, trained on exact causal dependency structures. This approach avoids the need for unordered or weak relational token representations and aligns the first-stage tokenizer's outputs with the requirements of autoregressive or MaskGIT image generators. The result is reinforced coherence and reduced conditional entropy among tokens, which in turn translates to improved FID scores across both AR and masked-token prediction regimes (Wu et al., 30 Jan 2026). A plausible implication is that this structural alignment lessens the downstream generator's burden in learning statistical dependencies, with substantial benefits for sample quality and consistency.

Markdown Report Issue Upgrade to Chat

References (1)

NativeTok: Native Visual Tokenization for Improved Image Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixture of Causal Expert Transformer (MoCET).