Lightweight Encoder-Decoder Models

Updated 11 December 2025

Lightweight encoder–decoder models are efficient neural architectures that leverage asymmetric design and pruning to reduce memory, compute, and latency.
They employ techniques such as structured pruning, quantization, and block replacement to achieve significant speed improvements and model size reductions.
These models are applied across NLP, speech recognition, vision, and generative tasks, enabling real-time deployment on resource-constrained platforms.

Lightweight encoder–decoder models are neural architectures designed to achieve high task performance (e.g., generation, segmentation, compression) under tight constraints of memory, compute, and latency. They enable practical deployment of end-to-end deep learning systems on resource-limited or real-time platforms without sacrificing accuracy beyond acceptable thresholds. Such models leverage architectural asymmetry, structured pruning, quantization, and tailored training to minimize critical-path resource consumption, with widespread application in speech recognition, NLP, vision, and generative modeling.

1. Core Architectural Strategies

Lightweight encoder–decoder models systematically reduce model size and computation by modifying both encoder and decoder sub-networks, prioritizing architectural asymmetry, modular complexity allocation, and operation minimization.

Asymmetric Design: Capacity is allocated disproportionately between encoder and decoder, with the encoder often heavier and the decoder aggressively compressed. This approach is supported in both vision (e.g., "AsymLLIC" (Wang et al., 23 Dec 2024)) and language SLMs ("Return of the Encoder" (Elfeki et al., 27 Jan 2025)), where the decoder's role is minimal either due to one-time forward passes (vision, inference) or autoregressive bottleneck (language, generation).
Structured Pruning and Layer Selection: Methods such as NASH ("Narrow Encoder, Shallow Decoder") (Ko et al., 2023) drop full decoder layers and selectively sparsify encoder heads/intermediate dimensions. In speech, DET ("Dynamic Encoder Transducer") enables dynamic selection of encoder depth and segment-wise assignment (Shi et al., 2021).
Quantization and Projection: Extreme parameter and footprint reduction is achieved by low-bit quantization (e.g., 8-bit weights in pQRNN-MAtt (Kandoor, 2021)), and by projection-based encodings, which decouple token representation from embedding tables.
Block Replacement and Pruning: Progressive substitution of heavy decoder blocks (e.g., replacing shifted-window Swin blocks with windowed variants, channel slicing in context models) allows fine-grained control of complexity-growth (as in AsymLLIC (Wang et al., 23 Dec 2024)).

2. Representative Models and Domains

NLP and Language Generation

NASH (Ko et al., 2023): Decoupled structured pruning—dropping decoder layers (dominant for speed) while lightly sparsifying encoder heads and dimensions. Yields 2.5–5× speedups and >95% full-model generation quality on T5/BART across summarization, question answering, and multi-task settings.
Sub-Billion SLMs (Elfeki et al., 27 Jan 2025): Encoder–decoder models with parameter split (2/3 to encoder), RoPE, GQA, and knowledge distillation from large decoder-only teachers. Demonstrates 47% lower first-token latency and 4.7× higher throughput versus decoder-only models at equivalent parameter counts.

Speech Recognition

Dynamic Encoder Transducer (DET) (Shi et al., 2021): Emformer-based RNN-T with multiple encoder depth variants (20/14 layers, sharing predictor and joiner). Layer dropout and collaborative learning train encoders to support on-demand depth at inference. Enables dynamic assignment for latency–accuracy trade-offs, with ∼25% parameter reduction and similar or better accuracy than full-size baselines.

Vision and Segmentation

AsymLLIC (Wang et al., 23 Dec 2024): Asymmetric learned image compression, allocating maximum complexity to encoder, substituting decoder modules with lighter transformer/cnn blocks, and using a two-stage fine-tuning pipeline. Reduces decoder GMACs by 65% and parameters by 53% versus symmetric baselines, with minimal (≤0.2 dB) performance loss.
LEDNet (Wang et al., 2019): Real-time semantic segmentation with aggressive encoder channel reduction (split–shuffle–non-bottleneck blocks), lightweight APN decoder. Delivers 0.94M parameters and 71 FPS at state-of-the-art accuracy.
Foot Ulcer Segmentation ResAttn-U (Ali et al., 2022): Channel/spatial attention within each residual block, group convolutions, ∼5M parameters (1/6th of U-Net), and patch-based training with TTA yields competitive accuracy with greatly reduced compute.

Generative Models

Lightweight Decoders for Latent Diffusion (Buzovkin et al., 6 Mar 2025): Replacement of heavy VAE decoders in image/video diffusion with transformer-based or linear-attention decoders (e.g., TAE-192, EfficientViT). Achieves up to 20× decoding speedup, 80% memory reduction, and acceptable fidelity reduction for large-scale inference.
Encoder–Decoder Diffusion for Language (Arriola et al., 26 Oct 2025): E2D2 splits context-building (encoder) from denoising (lightweight decoder) in block-wise discrete diffusion models, halving attention FLOPs and achieving 2–4× faster inference and 36.0 ROUGE-1/14.1 ROUGE-2/23.9 ROUGE-L at 156 tokens/sec on CNN/DM summarization, matching or exceeding prior decoder-only diffusion models.

3. Mathematical and Algorithmic Formulation

Loss Constructions: Loss formulations typically combine data loss, auxiliary distillation (KL), and regularization for pruned representations, e.g. in NASH:

$L_{\text{total}} = L_{\text{pred}} + \lambda_{\text{enc}} L_{h}^{\text{enc}} + \lambda_{\text{dec}} L_{h}^{\text{dec}} + R(L_0)$

with $L_{\text{pred}}$ as student–teacher KL, $L_{h}^{*}$ as hidden representation matching, and $R(L_0)$ as Lagrange-constrained structured sparsity.

Dynamic Depth and Layer Skipping: DET employs structured layer dropout: for a subset of layers $\mathcal{D}$ , a binary mask $m_l \sim \mathrm{Bernoulli}(1-p_l)$ skips layer $l$ , and at inference, specific layers can be forcibly skipped. The loss combines transducer, CE, and KLD terms to enable joint training of multiple encoder paths.
Parameter Complexity and Speedup: Factors affecting efficiency include FLOPs per frame/layer, parameter counts, and wall-clock throughput. For example, using a shallow decoder of depth $d_s$ in NASH prunes FLOPs/latency proportionally to $d_s/L_{\text{dec}}$ , while the impact of structured encoder pruning is more marginal for speed but beneficial for generalization.
Quantization: pQRNN-MAtt achieves sub-4 MB model size by 8-bit quantization over all trainable parameters, with locality-sensitive ternary hashing for token projection:

$T(x_t) \in \{-1, 0, +1\}^D$

followed by a low-dimensional bottleneck and bidirectional QRNN encoder.

Training Schedules: Most approaches employ staged fine-tuning—first on lighter decoders to maintain representational quality, then on full RD losses (vision), or with warm-up and constraint ramping for sparsity targets (language, NASH).

4. Empirical Performance and Trade-offs

Model/Domain	Key Metric(s)	Speedup	Size Red.	Quality Delta
NASH (T5-Base) (Ko et al., 2023)	SAMSum/ROUGE-L/MT	2.5–5×	--	>95% full-model
DET (ASR) (Shi et al., 2021)	RTF/WER	30%+	25%	Similar as full-size
AsymLLIC (LIC) (Wang et al., 23 Dec 2024)	Decoder GMACs	65%	53%	≤0.2 dB diff
LEDNet (segm.) (Wang et al., 2019)	FPS/mIoU	3–10× vs. UNet	×30	On-par/SoTA
Diffusion Decoding (Buzovkin et al., 6 Mar 2025)	Decode time	2–20×	80%	SSIM/PSNR: ~10% drop
pQRNN-MAtt (Kandoor, 2021)	EM (%) MTOP	--	85×	>LSTM baseline

Empirical results reveal that decoder depth dominates inference latency (NASH, DET), with shallow decoders yielding 2–5× speedups while maintaining near-baseline accuracy. Aggressive channel reduction, group convolution, and channel shuffle (LEDNet) retain feature capacity with minor accuracy trade-offs. Quantization (pQRNN-MAtt) compresses model footprints nearly two orders of magnitude, showing <2 pt accuracy drop at 85× size reduction.

Quality–Efficiency Pareto

In all domains, lightweight encoder–decoder models define new Pareto frontiers, outperforming parameter-matched alternatives (e.g., T5-Small) and often surpassing prior hand-tuned or symmetrical baselines in accuracy-under-budget regimes.

5. General Design Principles and Best Practices

Asymmetric Complexity Allocation: Allocate maximal representational power to one stage (often the encoder) and minimize the other (decoder), unless the target hardware imposes different constraints (Wang et al., 23 Dec 2024, Elfeki et al., 27 Jan 2025).
Progressive Block Substitution: Prefer module-wise pruning/replacement with stepwise re-training to abrupt, global compression (Wang et al., 23 Dec 2024).
Sparsity Control: For transformer-based models, prefer moderate structured sparsity in the encoder for regularization and shallow depth in the decoder for speed (Ko et al., 2023).
Attention Optimization: Channel split and shuffle with depthwise convolutions (LEDNet), windowed attention without shifts, and minimal multi-scale fusion (AsymLLIC, segmentation) reduce complexity with limited accuracy loss.
Quantization: Unified low-bit quantization is essential for deployment on mobile or edge devices, providing dramatic size reductions for sequence models (Kandoor, 2021).
Integration of Modern Techniques: Pre-layer normalization, rotary or learned positional embeddings, grouped-query attention, and hardware-aware ONNX compilation are critical for sub-billion SLMs (Elfeki et al., 27 Jan 2025).

6. Application Domains and Scope

Lightweight encoder–decoder models are now integral to the deployment of efficient speech recognition (DET, real-time streaming ASR), document and code generation (NASH, sub-1B SLMs), image and video compression (AsymLLIC), semantic segmentation (LEDNet, Foot Ulcer Seg. ResAttn-U), and high-throughput generative modeling (latent diffusion, E2D2). These models are particularly relevant when the hardware in deployment (client/edge/mobile) cannot accommodate the resource profile of full-scale, symmetric, or decoder-only state-of-the-art architectures.

A general implication is that as model scaling saturates for real-world use cases with limited resources, further efficiency gains will be achieved through careful architectural rebalancing, modular design, progressive pruning, and matching the allocation of capacity to algorithmically dominant bottlenecks (e.g., decoder depth for generation, upsampling cost for vision). Lightweight encoder–decoder architectures remain an essential foundation for practical, ubiquitous deep learning deployments across modalities and application domains.