Compact Decoder-Only Transformer Model
- Compact Decoder-Only Transformer model is a neural architecture that uses only the decoder stack to efficiently handle sequence tasks through specialized compression techniques.
- It employs methods like parallel stream processing, linear and convolutional reductions, and dynamic layer skipping to reduce parameters and accelerate inference.
- Empirical results indicate that such designs maintain near-baseline performance with up to 36% parameter reduction, benefiting applications in language, translation, and speech.
A compact decoder-only Transformer model is a neural architecture designed to achieve high computational and memory efficiency for sequence modeling, usually in language, code, or specialized signal domains. While "decoder-only" refers to the use of only the Transformer decoder stack—eschewing the encoder—compactness is realized through architectural compression, parameter sharing, dimension-reduction strategies, context compression, or dynamic inference techniques. These models are increasingly studied for deployment in resource-constrained environments and for practical applications where ultra-large models are infeasible.
1. Fundamental Principles and Motivations
Classic decoder-only Transformers, such as the GPT series, comprise a stack of identical blocks containing masked self-attention, feed-forward networks, residual connections, and normalization layers. Their limitations include quadratic self-attention complexity ( in context length ) and deep, over-parameterized towers leading to slow inference and training redundancy. The recent push for compactness is motivated by diminishing returns at large scale, hardware constraints, and the need for adaptable inference trade-offs without pronounced degradation in generative or predictive task performance (Suresh et al., 2024).
2. Architectural Compression and Variants
Several architectural innovations target reduction in model size and computational cost:
ParallelGPT employs an embedding expansion to $2D$ (where is the hidden dimension), splitting activations into two parallel streams processed by decoder blocks each. Outputs are fused via a learnable scalar and can optionally drop one stream at inference, yielding up to parameter reduction and inference acceleration.
LinearlyCompressedGPT organizes decoder blocks into "pairs" per width, followed by linear projection down to half the previous dimension, producing a staged reduction . This schedule preserves learnability in early layers, sharply shrinks parameter count downstream, and has empirically shown negligible impact on validation loss.
ConvCompressedGPT replaces linear projection in the compression schedule with a 1D convolution (), leveraging both dimension reduction and local sequential pattern learning—a hybrid of CNN and Transformer nutritional pathways.
| Model | Params (M) | Size (MB) | 10k Steps Time (min) |
|---|---|---|---|
| vanilla | 8.82 | 33.66 | 25.35 |
| Parallel | 9.74 | 37.14 | 26.15 |
| Parallel-1 | 6.19 | 23.60 | 26.15 |
| LinearComp | 5.65 | 21.54 | 20.68 |
| ConvComp | 5.65 | 21.54 | 21.68 |
All compaction strategies maintain cross-entropy loss within 0.02 nats/token of the baseline (Suresh et al., 2024).
3. Dynamic Compression and Layer Selection
Dynamic layer selection methods—including layer skipping and early exiting—provide sample-conditional compute reduction.
In layer skipping, binary gates at each layer determine execution per token . The system is trained to minimize cross-entropy subject to a computational cost penalty, with discrete gates realized via Gumbel-Softmax relaxation. Uniform layer skipping, even with token-agnostic gates, outperforms early exiting, which only uses a prefix of the model per output. Empirical results show that, with oracle per-sequence allocation, models can match full performance while using only 23.3% of layers on average (Glavas et al., 2024).
Per-token controllers trained to exploit hidden state provide no clear advantage over fixed skip rates, suggesting the majority of savings arise from sequence-level adaptive allocation.
4. Context Compression and Low-Rank Adaptation
Contextual compression can further amplify compactness by reducing the number of active hidden states per forward pass.
Dodo implements a dynamic nugget-based compression, selecting ("nuggets") via a learned scorer at fixed layers. The result is attention computations with cost versus in unconstrained self-attention, with for compression ratio . Dodo achieves 20× context compression with only 2% loss in BLEU score on autoencoding and maintains parity with uncompressed autoregressive models in perplexity and QA/summarization performance, while requiring only LoRA adaptation and a small scorer network (Qin et al., 2023).
Empirically, context compression enables LLaMA-scale models to process vastly larger windows with minimal memory and compute growth, and few-shot adaptation with LoRA ranks as low as 32 suffices.
5. Empirical Scaling Laws and Model Sizing for Translation
Scaling law studies for decoder-only models in multilingual machine translation reveal a test loss curve fit by and a bi-variate law in model size and training data (Caillaut et al., 2024). Both depth and width scaling yield similar loss improvements per unit FLOP; however, width scaling better utilizes hardware for higher throughput. For models M params, a 6-layer, 768-1024 hidden configuration is both efficient and competitive in throughput, with training data recommendations of at least tokens for optimal learning.
6. Application-Specific Design: Translation, Localization, and Speech
DIETA (Italian-English MT, 0.5B params) demonstrates efficacy for language pairs via judicious width/depth choice, RoPE embeddings, RealFormer residuals, and no explicit dropout. A 6-layer, 2048-dim stack trained on 768M sentence pairs achieves state-of-the-art in the sub-1B regime on five benchmarks, and inference fits an 8GB GPU at 20 tok/sec (Kasela et al., 25 Jan 2026).
Locaris (Wi-Fi indoor localization, 1B params) represents measurements as tokens, freezes the backbone, and fine-tunes only LoRA adapters and a regression head. The result is sub-meter median localization performance, robust to missing APs, with fast convergence using 3–10% of the calibration data compared to 50–75% for classical systems (Bhatia et al., 13 Oct 2025).
T-Mimi (TTS, 40.8M params) replaces convolutional upsampling by stacking 12 decoder-only Transformer layers and two linear heads, quantizing early layers to 8-bit to maximize efficiency. This reformulation yields a 9.6× latency reduction with no significant drop in PESQ or SI-SDR quality relative to the original mixed convolution-Transformer design. The final two layers and linear upsampling must remain at full precision for audio fidelity (Wu et al., 27 Jan 2026).
7. Robustness, Expressive Equivalence, and Implementation Considerations
A single-layer decoder-only Transformer exactly recasts as a two-layer RNN, preserving expressiveness for certain tasks and enabling efficient certified robustness verification via interval analysis over RNN states. This insight motivates the use of minimum-layer designs for provably robust, compact architectures, while handling token and position embeddings separately to safeguard position-sensitivity under adversarial or non-length preserving perturbations (Zhang et al., 2024).
Critical implementation details for compactness include:
- Careful tokenization (character-level as dimensions shrink or BPE for language tasks),
- Dropout minimization to conserve capacity,
- Weight tying between embeddings/output heads,
- Quantization (8/4-bit) for final deployment,
- LoRA or low-rank compression for efficient adaptation.
References
- (Suresh et al., 2024) Towards smaller, faster decoder-only transformers: Architectural variants and their implications
- (Qin et al., 2023) Dodo: Dynamic Contextual Compression for Decoder-only LMs
- (Glavas et al., 2024) Dynamic layer selection in decoder-only transformers
- (Kasela et al., 25 Jan 2026) DIETA: A Decoder-only transformer-based model for Italian-English machine TrAnslation
- (Caillaut et al., 2024) Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task
- (Sun et al., 2024) You Only Cache Once: Decoder-Decoder Architectures for LLMs
- (Bhatia et al., 13 Oct 2025) Indoor Localization using Compact, Telemetry-Agnostic, Transfer-Learning Enabled Decoder-Only Transformer
- (Wu et al., 27 Jan 2026) T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS
- (Zhang et al., 2024) A One-Layer Decoder-Only Transformer is a Two-Layer RNN: With an Application to Certified Robustness