Modularised Decoder-Only LLM
- Modularised decoder-only LLMs are transformer-based generative models that decouple attention, feed-forward, output, and adapter modules to boost flexibility and efficiency.
- They employ techniques such as Mixture-of-Experts, LoRA, multi-head decoding, and dynamic layer selection to optimize training costs and accelerate inference.
- These architectures deliver practical benefits in code translation, multimodal generation, and long-context memory, enabling domain specialization and adaptive performance.
A modularised decoder-only LLM is a transformer-based generative system in which functional components—attention mechanisms, feed-forward networks, output heads, and adapter modules—are structurally or operationally decoupled to achieve greater flexibility, efficiency, or domain specialisation. Unlike classic monolithic decoder stacks, these architectures employ explicit modularisation, such as Mixture-of-Experts (MoE), multi-head decoding, dynamic layer selection, or intermediary input/output decoupling. They enable model designers to combine frozen backbones with lightweight adapters or submodules, route computation depending on input or task, and integrate diverse modalities into a unified autoregressive framework.
1. Modularisation Techniques in Decoder-Only Architectures
Central modularisation mechanisms in decoder-only LLMs include:
- Mixture-of-Experts (MoE): Models such as SteloCoder and OneCAT introduce intermediate gating networks and multiple domain-specific or modality-specific expert subnetworks (e.g., one per programming language or input modality). The gating module computes expert-selection probabilities, either soft during training or hard during inference, and routes each token's representation through the corresponding expert's lightweight adapter (Pan et al., 2023, Li et al., 3 Sep 2025).
- Low-Rank Adaptation (LoRA): Fine-tuned LoRA adapters are added to self-attention and feed-forward projections of a frozen large backbone, dramatically reducing training cost and parameter count while retaining generalisation capacity. Each expert in SteloCoder, for instance, consists of LoRA rank-4 updates representing only 0.06% of StarCoder's parameters per expert (Pan et al., 2023, Jin et al., 31 Aug 2025, Tan et al., 7 Dec 2025).
- Multi-Head Output Decoding: SpeLLM employs multiple character-level output heads ( linear projections over a small character vocabulary) to decouple input and output vocabulary sizes, mitigating the quadratic attention cost bottleneck in large-token-output heads (Ben-Artzy et al., 22 Jul 2025).
- Dynamic Layer Selection: Models are equipped with controllers (either per-token or per-sequence) enabling runtime selection of which transformer layers to execute or skip, optimising inference efficiency and resource utilisation (Glavas et al., 26 Oct 2024).
- Modality-Specific Routing: Multimodal architectures such as OneCAT use per-token modality routing, deploying visual and textual FFN-experts within a unified stack to handle discrete and continuous tokens, eliminating the need for external vision encoders or tokenisers (Li et al., 3 Sep 2025).
2. Mathematical Formulation and Data Flow
Mixture-of-Experts Gating
For experts, with input embeddings and gating weights , the MoE output is: During inference, the maximum gate probability selects the expert.
Low-Rank Adapter Injection
For a given weight matrix and LoRA rank- update,
Final weight: .
Multi-Head Character Decoding (SpeLLM)
For output token spelling of up to characters across -size character vocab,
Output reconstruction by concatenating .
Dynamic Layer Skipping
With binary skip variable ,
3. Training and Adaptation Protocols
Modular training typically comprises:
- Expert fine-tuning: Independent adapter modules for each expert are trained using cross-entropy objectives over target tokens, with the backbone frozen. SteloCoder’s experts each require 6 hours on a single A100 GPU for XLCoST code translation (Pan et al., 2023).
- Gate network learning: A lightweight gating classifier is trained post hoc on balanced program-level samples to route input tokens to the best expert per instance (Pan et al., 2023).
- Self-Distillation: SpeLLM adapts pre-trained LLMs using a two-term loss function combining character-level cross-entropy and token-level mimicry of teacher predictions, freezing most of the backbone except last feed-forward layers and new output heads (Ben-Artzy et al., 22 Jul 2025).
- Cyclical Masking for Multi-token Decoding: Direct Multi-Token Decoding (DMTD) fine-tunes with positional masking to teach the late layers to anticipate future tokens per context window, reducing per-token computation cost by up to 2.15× (Luo et al., 13 Oct 2025).
- Parameter-efficient transfer: LoRA and QLoRA, yielding <0.5% additional parameters for robust domain adaptation under distribution shift (Chinese AI-generated text detection achieving 95.94% test accuracy with LoRA-adapted Qwen2.5) (Jin et al., 31 Aug 2025).
4. Applications and Empirical Performance
Modularised decoder-only architectures have demonstrated impact across multiple domains:
- Code Translation: SteloCoder’s MoE+LoRA adaptation outperforms XLCoST leaderboard baselines by ≥3.5 CodeBLEU across five programming language pairs (C++, C#, JS, Java, PHP to Python) with only ≈45M extra parameters (Pan et al., 2023).
- Multimodal Generation and Editing: OneCAT’s modality-aware MoE supports raw image and text generative tasks without external vision tokenisers, yielding up to 10.1× inference speedup over diffusion-based models and new state-of-the-art scores across GenEval, DPG-Bench, and ImgEdit-Bench (Li et al., 3 Sep 2025).
- Speech-to-Text and Cross-modal Integration: Speech-LLaMA modularises speech frontend, CTC compressor, projection adapter, and decoder-only LLM. Swapping in vision/video/tabular encoders allows plug-and-play multimodal integration, with up to 26.3 BLEU in multilingual ST tasks (Wu et al., 2023).
- Long-context Memory and Latency: YOCO decouples prompt encoding and autoregressive generation, reducing KV-cache memory by up to 9.7× and increasing throughput by up to 9.6× for million-token contexts (Sun et al., 8 May 2024).
- Personality Recognition: PICEPR splits personality tasks into summary, facet extraction, mimicry generation, embedding, and classification modules, delivering 5–15% accuracy improvement over vanilla fine-tuning for both Big-5 and MBTI essays (Tan et al., 7 Dec 2025).
- Efficiency via Dynamic Layer Selection: Token- or sequence-level skipping, enabled by simple controllers, achieves performance parity with full-model generation using only ≈23% of executed layers, reducing FLOPs by up to 3.7× (Glavas et al., 26 Oct 2024).
5. Modularity Trade-Offs and Limitations
While modularisation offers pronounced benefits in adaptability, efficiency, and interpretability, associated limitations include:
- Slight accuracy drop in some downstream tasks: SpeLLM and SteloCoder report small gaps for data rich tasks (e.g., GSM8K or CNN/DailyMail), though partial recovery is possible via fallback strategies or improved controller design (Ben-Artzy et al., 22 Jul 2025, Pan et al., 2023).
- Non-pure modularity: Techniques like SpeLLM still rely on byte-pair encoding for input and fallback correction, so purely character-level modeling is not fully achieved (Ben-Artzy et al., 22 Jul 2025).
- Controller weakness: Small linear per-token controllers in dynamic skipping are largely insensitive to input hidden state, providing negligible improvement over fixed schedules unless controller complexity is increased (Glavas et al., 26 Oct 2024).
- Increased parameter count for MoE: Each transformer block in OneCAT stores three modality-specific FFNs, inflating model size to 9B parameters (3B active), although this does not increase per-token inference cost (Li et al., 3 Sep 2025).
- Training cost for multi-expert or multi-adapter architectures: Training many LoRA experts and gating modules, even if individually efficient, can aggregate substantial time and computational cost for large model families (Pan et al., 2023, Li et al., 3 Sep 2025).
6. Future Directions and Practical Considerations
Emerging research suggests multiple promising avenues:
- Unified decoder-only multimodal models: Elimination of modality-specific external encoders (e.g., ViTs, text tokenisers) in favour of internal patch embedders and modality-aware MoE routing (Li et al., 3 Sep 2025).
- Scalable long-context transfer: Chunk-parallel training and efficient single-pass caching (YOCO) enable native support for 1M+ token sequences without memory bottlenecks, integral for extended context comprehension and retrieval (Sun et al., 8 May 2024).
- Cross-domain generalisation: LoRA-based modularisation exhibits superior transfer under domain shift compared to fully fine-tuned encoders, supporting robust detection and classification in diverse linguistic contexts (Jin et al., 31 Aug 2025).
- Dynamic execution and adaptive inference: Engineering dynamic routers for per-sequence (not per-token) layer allocation can recover full-model quality at <¼ compute on long prompts, with routine integration into latency-critical deployments (Glavas et al., 26 Oct 2024).
- Plug-and-play modularity: The consistent decoupling of summarisation, facet extraction, generation, embedding, and classification as discrete modules in personality recognition pipelines enables easy ablation, replacement, and benchmarking for interdisciplinary research (Tan et al., 7 Dec 2025).
7. Comparative Overview of Modular Decoder-Only Architectures
| Model | Modularisation Approach | Core Achievement |
|---|---|---|
| SpeLLM | Character-level multi-heads | Output decoupling, 5.1% speedup, rare language support (Ben-Artzy et al., 22 Jul 2025) |
| SteloCoder | MoE+LoRA per code language | +3.5 CodeBLEU, 32hr training, 0.3% extra params (Pan et al., 2023) |
| OneCAT | Modality-specific MoE | Unified multimodal AR gen/edit/understand, 10× speedup (Li et al., 3 Sep 2025) |
| YOCO | Self- vs. cross-decoder split | 1M context, >9× memory reduction, up to 10× throughput (Sun et al., 8 May 2024) |
| PICEPR | Pipeline modularisation | 5–15% recognition boost, flexible task swapping (Tan et al., 7 Dec 2025) |
| Speech-LLaMA | Modular frontend→adapter→decoder | Cross-modal extension, 22% BLEU increase (Wu et al., 2023) |
| DMTD | Layerwise modular pass/reuse | 2× speedup, minor loss, log-linear scaleup (Luo et al., 13 Oct 2025) |
| Dynamic Skipping | Token/sequence layer selection | 76.7% compute reduction, full-model ROI at ≈23% layers (Glavas et al., 26 Oct 2024) |
In sum, modularised decoder-only LLMs reflect a paradigm shift toward structurally adaptive generative models, where decoupled submodules unlock domain specialisation, scalable efficiency, and multimodal extensibility across a growing universe of tasks and benchmarks.