Mixture of Causal Expert Transformer (MoCET)
- The paper introduces MoCET, a transformer architecture that assigns a dedicated, lightweight expert to each token position for strict causal dependency modeling.
- It decouples global context via a Meta Image Transformer and positional dependency modeling through specialized transformer blocks to enhance autoregressive token generation.
- MoCET improves visual token coherence and conditional entropy reduction, leading to better sample quality and FID scores in image generation tasks.
The Mixture of Causal Expert Transformer (MoCET) is a position-specialized, ensemble-based transformer architecture developed for discrete token generation in visual tokenization workflows. Unlike conventional monolithic or mixture-of-expert (MoE) transformers relying on learned soft gating, MoCET assigns a dedicated, lightweight transformer block to each token position, facilitating strict causal dependency modeling and alignment with autoregressive generation schemes. MoCET constitutes a core component of NativeTok—a visual tokenization and image generation framework designed to enforce causal dependencies among image tokens, thereby improving generation coherence and downstream model performance (Wu et al., 30 Jan 2026).
1. Architectural Overview
MoCET resolves the joint token generation problem by maintaining a collection of transformer experts, each specialized for a single token position. Let denote the total number of tokens per image (e.g., ). The expert pool is formalized as , with each a two-layer transformer block of hidden dimension (by default, ).
During token generation, only the -th expert is active when producing token . This decomposition enables separation of:
- Global context modeling, achieved once by a Meta Image Transformer (MIT), producing a fixed image representation .
- Positional dependency modeling, performed by each based on prior tokens and the fixed latent.
For each step :
where is a mask or padding embedding for position . The vector is then projected by a linear head (with as the codebook size) to produce token logits.
2. Causal Modeling and Factorization
MoCET enforces a strict causal factorization of the joint distribution over tokens, conditioned on the MIT-derived latent:
After computing , the conditional distribution for is:
Token is selected either by maximum likelihood (teacher-forcing during training) or via categorical sampling at inference.
This architecture permits each expert to specialize in modeling the conditional distribution for its target position, aligning pretraining with the needs of autoregressive and masked-token generators.
3. Routing and Expert Assignment
Contrary to standard Mixture-of-Experts architectures, which use dynamically learned soft gates, MoCET employs fixed, one-hot routing. Expert is exclusively responsible for token , formalized via gate vector with iff , and $0$ otherwise:
While soft or learned routing is theoretically possible (e.g., ), NativeTok’s instantiation of MoCET restricts each position to a dedicated expert, eliminating the ambiguity and overhead of dynamic expert allocation.
4. Training Objectives and Hierarchical Native Training
MoCET-based tokenization training in NativeTok optimizes two objectives concurrently:
- Token cross-entropy loss over generated discrete tokens:
with as the ground-truth codebook index.
- image reconstruction loss after passing ground-truth tokens through an image decoder :
The aggregate objective is
where trades off discrete-token accuracy and pixel-level reconstruction fidelity.
To scale to longer token sequences, Hierarchical Native Training (HNT) is adopted:
- Train a base configuration (e.g., NativeTok) end-to-end.
- Expand to larger (e.g., 64) by freezing MIT and initial experts, weight-cloning into new experts, and training only new experts and the decoder (resulting in training ≈56% of parameters).
- Optionally fine-tune all parameters in the final phase to ensure expert harmonization.
5. Coordination with Meta Image Transformer
MoCET operates in conjunction with a Meta Image Transformer (MIT), which is a ViT-style encoder followed by an FNN for dimensionality adaptation. The workflow is:
is fixed throughout MoCET token generation, enabling all experts to share access to a consistent, context-rich representation while maintaining strict autoregressive dependencies among tokens.
6. Implementation Characteristics and Performance
MoCET’s resource profile and speed are determined by the number of position-specific experts () and their parameterization. Notable details:
- Number of experts:
- Each expert: 1 transformer layer, , 4 attention heads
- MIT: 18 layers, hidden dimension 1024, output compressed via FNN
- Image decoder: 24 layers, hidden dimension 1024
- Total parameters: 616M (), 666M (), 766M ()
- Generation complexity per token:
- Encoding speed: approx. 120 samples/s (), 54 samples/s (), 39 samples/s () on A800 × 4
The small overhead from MoCET arises from its use of and moderate .
7. Significance for Visual Tokenization and Generation
MoCET systematically addresses the mismatch between bidirectional encoders and autoregressive generators by specializing token generation through position-specific experts, trained on exact causal dependency structures. This approach avoids the need for unordered or weak relational token representations and aligns the first-stage tokenizer's outputs with the requirements of autoregressive or MaskGIT image generators. The result is reinforced coherence and reduced conditional entropy among tokens, which in turn translates to improved FID scores across both AR and masked-token prediction regimes (Wu et al., 30 Jan 2026). A plausible implication is that this structural alignment lessens the downstream generator's burden in learning statistical dependencies, with substantial benefits for sample quality and consistency.