Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mixture of Causal Expert Transformer (MoCET)

Updated 6 February 2026
  • The paper introduces MoCET, a transformer architecture that assigns a dedicated, lightweight expert to each token position for strict causal dependency modeling.
  • It decouples global context via a Meta Image Transformer and positional dependency modeling through specialized transformer blocks to enhance autoregressive token generation.
  • MoCET improves visual token coherence and conditional entropy reduction, leading to better sample quality and FID scores in image generation tasks.

The Mixture of Causal Expert Transformer (MoCET) is a position-specialized, ensemble-based transformer architecture developed for discrete token generation in visual tokenization workflows. Unlike conventional monolithic or mixture-of-expert (MoE) transformers relying on learned soft gating, MoCET assigns a dedicated, lightweight transformer block to each token position, facilitating strict causal dependency modeling and alignment with autoregressive generation schemes. MoCET constitutes a core component of NativeTok—a visual tokenization and image generation framework designed to enforce causal dependencies among image tokens, thereby improving generation coherence and downstream model performance (Wu et al., 30 Jan 2026).

1. Architectural Overview

MoCET resolves the joint token generation problem by maintaining a collection of transformer experts, each specialized for a single token position. Let LL denote the total number of tokens per image (e.g., L=32,64,128L=32,64,128). The expert pool is formalized as T={T1,T2,,TL}\mathbb{T} = \{T_1, T_2, \dots, T_L\}, with each TiT_i a two-layer transformer block of hidden dimension ded_e (by default, de=256d_e=256).

During token generation, only the ii-th expert TiT_i is active when producing token ziz_i. This decomposition enables separation of:

  • Global context modeling, achieved once by a Meta Image Transformer (MIT), producing a fixed image representation XlatentRNlat×dlatX_\mathrm{latent} \in \mathbb{R}^{N_{\mathrm{lat}} \times d_{\mathrm{lat}}}.
  • Positional dependency modeling, performed by each TiT_i based on prior tokens and the fixed latent.

For each step ii:

hi=Ti([  Xlatent;z1;;zi1;ei  ])h_i = T_i\Big([\; X_{\mathrm{latent}} \,;\, z_1 \,;\, \dots \,;\, z_{i-1} \,;\, e_i \;]\Big)

where eiRdee_i \in \mathbb{R}^{d_e} is a mask or padding embedding for position ii. The vector hih_i is then projected by a linear head WoRde×KW_o \in \mathbb{R}^{d_e \times K} (with KK as the codebook size) to produce token logits.

2. Causal Modeling and Factorization

MoCET enforces a strict causal factorization of the joint distribution over tokens, conditioned on the MIT-derived latent:

p(z1:LXlatent)=i=1Lp(ziz<i,Xlatent)p(z_{1:L} \mid X_{\mathrm{latent}}) = \prod_{i=1}^L p(z_i \mid z_{<i}, X_{\mathrm{latent}})

After computing hih_i, the conditional distribution for ziz_i is:

p(zi=kz<i,Xlatent)=Softmaxk(Wohi)p(z_i = k \mid z_{<i}, X_{\mathrm{latent}}) = \mathrm{Softmax}_k(W_o^\top h_i)

Token ziz_i is selected either by maximum likelihood (teacher-forcing during training) or via categorical sampling at inference.

This architecture permits each expert TiT_i to specialize in modeling the conditional distribution for its target position, aligning pretraining with the needs of autoregressive and masked-token generators.

3. Routing and Expert Assignment

Contrary to standard Mixture-of-Experts architectures, which use dynamically learned soft gates, MoCET employs fixed, one-hot routing. Expert TiT_i is exclusively responsible for token ziz_i, formalized via gate vector gi{0,1}Lg_i \in \{0,1\}^L with (gi)j=1(g_i)_j = 1 iff j=ij=i, and $0$ otherwise:

zi=j=1L(gi)jTj([])=Ti([])z_i = \sum_{j=1}^L (g_i)_j \, T_j([\ldots]) = T_i([\ldots])

While soft or learned routing is theoretically possible (e.g., αi=Softmax[U[Xlatent;z<i]]\alpha_i = \mathrm{Softmax}[U [X_{\mathrm{latent}}; z_{<i}]]), NativeTok’s instantiation of MoCET restricts each position to a dedicated expert, eliminating the ambiguity and overhead of dynamic expert allocation.

4. Training Objectives and Hierarchical Native Training

MoCET-based tokenization training in NativeTok optimizes two objectives concurrently:

  • Token cross-entropy loss over generated discrete tokens:

LCE=i=1Llogp(ziz<i,Xlatent)\mathcal{L}_{\mathrm{CE}} = -\sum_{i=1}^L \log p(z_i^* \mid z_{<i}^*, X_{\mathrm{latent}})

with ziz_i^* as the ground-truth codebook index.

  • 2\ell_2 image reconstruction loss after passing ground-truth tokens through an image decoder Decθ\mathrm{Dec}_\theta:

Lrec=XDecθ(z1:L)22\mathcal{L}_{\mathrm{rec}} = \| X - \mathrm{Dec}_\theta(z_{1:L}^*) \|_2^2

The aggregate objective is

L=LCE+λLrec\mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda \mathcal{L}_{\mathrm{rec}}

where λ\lambda trades off discrete-token accuracy and pixel-level reconstruction fidelity.

To scale to longer token sequences, Hierarchical Native Training (HNT) is adopted:

  • Train a base configuration (e.g., NativeTok32_{32}) end-to-end.
  • Expand to larger LL (e.g., 64) by freezing MIT and initial experts, weight-cloning into new experts, and training only new experts and the decoder (resulting in training ≈56% of parameters).
  • Optionally fine-tune all parameters in the final phase to ensure expert harmonization.

5. Coordination with Meta Image Transformer

MoCET operates in conjunction with a Meta Image Transformer (MIT), which is a ViT-style encoder followed by an FNN for dimensionality adaptation. The workflow is:

H=MIT(X)RNlat×dhighH = \mathrm{MIT}(X) \in \mathbb{R}^{N_{\mathrm{lat}} \times d_{\mathrm{high}}}

Xlatent=FNN(H)RNlat×deX_{\mathrm{latent}} = \mathrm{FNN}(H) \in \mathbb{R}^{N_{\mathrm{lat}} \times d_e}

XlatentX_{\mathrm{latent}} is fixed throughout MoCET token generation, enabling all experts to share access to a consistent, context-rich representation while maintaining strict autoregressive dependencies among tokens.

6. Implementation Characteristics and Performance

MoCET’s resource profile and speed are determined by the number of position-specific experts (LL) and their parameterization. Notable details:

  • Number of experts: L{32,64,128}L\in\{32,64,128\}
  • Each expert: 1 transformer layer, de=256d_e=256, 4 attention heads
  • MIT: 18 layers, hidden dimension 1024, output compressed via FNN
  • Image decoder: 24 layers, hidden dimension 1024
  • Total parameters: 616M (L=32L=32), 666M (L=64L=64), 766M (L=128L=128)
  • Generation complexity per token: O((Nlat+i)2de)O((N_{\mathrm{lat}}+i)^2 d_e)
  • Encoding speed: approx. 120 samples/s (L=32L=32), 54 samples/s (L=64L=64), 39 samples/s (L=128L=128) on A800 × 4

The small overhead from MoCET arises from its use of dedhighd_e \ll d_{\mathrm{high}} and moderate NlatN_{\mathrm{lat}}.

7. Significance for Visual Tokenization and Generation

MoCET systematically addresses the mismatch between bidirectional encoders and autoregressive generators by specializing token generation through position-specific experts, trained on exact causal dependency structures. This approach avoids the need for unordered or weak relational token representations and aligns the first-stage tokenizer's outputs with the requirements of autoregressive or MaskGIT image generators. The result is reinforced coherence and reduced conditional entropy among tokens, which in turn translates to improved FID scores across both AR and masked-token prediction regimes (Wu et al., 30 Jan 2026). A plausible implication is that this structural alignment lessens the downstream generator's burden in learning statistical dependencies, with substantial benefits for sample quality and consistency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixture of Causal Expert Transformer (MoCET).