All-MLP Decoder: Pure Feed-Forward Architecture

Updated 2 March 2026

All-MLP decoders are neural architectures composed solely of MLP layers, enabling efficient parallel processing and low-latency inference.
They deliver competitive error-correction and language modeling outcomes by replacing complex mechanisms with uniform matrix operations.
These decoders adapt to varied applications—from PAC and LDPC code decoding to sparse language models and vision segmentation—while maintaining scalability on hardware accelerators.

An all-MLP decoder is a neural network-based decoding architecture in which every learnable layer is a multi-layer perceptron (MLP), with no recurrence, convolution, or explicit attention mechanism. Such decoders are used for tasks in code decoding, sequence modeling, and vision, where they provide high parallelizability, competitive performance, and favorable throughput under parallel hardware, by eschewing structural inductive biases in favor of pure feed-forward computation.

1. Foundational Principles and Motivation

All-MLP decoders are motivated by the need for high-throughput, parallelizable architectures that can be efficiently mapped to hardware accelerators. Traditional decoders for codes (e.g., PAC, LDPC, convolutional codes) and sequence/vision models often require domain-specific algorithms such as belief propagation or self-attention, which impose sequential dependencies or complex routing operations. By formulating the decoder as a stack of pure MLP blocks, all-MLP architectures enable fully parallel matrix operations, low-latency inference, and the potential to closely match the error-correcting or modeling performance of more sophisticated methods, without the computational overhead of attention or recurrence (Dai et al., 2024, Cui et al., 2024, Yu et al., 2022, Ma et al., 2023, Karami et al., 2014).

2. Canonical Architectures

2.1 PAC Code Decoding

An all-MLP decoder for PAC codes comprises several densely connected feed-forward layers that map channel outputs (e.g., LLR vectors $\mathbf{x} \in \mathbb{R}^{N}$ ) to bit probability estimates ( $\mathbf{p} \in [0,1]^K$ ). Each hidden layer applies a ReLU nonlinearity, culminating in a sigmoid output for bit-wise probabilities.

Layer layout:
- Input: length $N$ (e.g., $N=16$ )
- Hidden 1: 128 neurons, ReLU
- Hidden 2: 128, ReLU
- Hidden 3: 128, ReLU
- Output: $K$ neurons (e.g., $K=8$ ), Sigmoid
Parameter count: 37,000
Loss: Binary cross-entropy

2.2 LDPC Decoding

In the all-MLP decoder for LDPC codes (Karami et al., 2014), the network implements the code's Tanner graph as a statically-wired 2-layer MLP, with:

Input layer: $n$ neurons (code bits), initial state set to received soft values.
Output layer: $m$ neurons (parity-checks), each computes a differentiable analog XOR over its inputs using $x\oplus y = x(1-y) + y(1-x)$ . There are no trainable weights.
Decoding: The input vector is iteratively refined by gradient descent on the sum-of-squares parity-check error.

2.3 Sequence Modeling: Causal Relation Networks

Causal Relation Networks (CausalRN) implement an auto-regressive, all-MLP sequence decoder able to match Transformer performance in copying and language modeling tasks (Cui et al., 2024). Each CausalRN block computes:

Pre-activation normalization on key and query projections.
Elementwise exponential activation: $\exp(\tilde p_i + \tilde q_j)$ .
Causal sum: memory pool $M_j = \sum_{i=1}^j \exp(\tilde p_i)$ ; output $S_j = (\exp(\tilde q_j) \circ M_j)/j$ .
MLP projection and residual, followed by LayerNorm.

This structure enables $O(1)$ per-token updates via memory pooling and prohibits collapse into an RNN by pre-activation norm.

2.4 Sparse All-MLP for Language Modeling

Efficient LLMs have been realized with sparse all-MLP decoders using both token-wise and feature-wise mixture-of-experts (MoE) (Yu et al., 2022):

Token-wise MoE: Each input token is routed to a sparse subset (typically top-1) of MLP experts.
Feature-wise MoE: Each chunk of the feature vector is independently routed to a specific expert MLP.
Routing is enforced via deterministic assignment or partial prediction.
This allows aggressive model scaling while keeping FLOPs per token constant.

2.5 Patch Rotate MLP Decoder for Vision

The PRSeg architecture (Ma et al., 2023) introduces a spatial mixing mechanism into all-MLP decoders for semantic segmentation:

DPR-Block: Consists of Dynamic Channel Selection Module, Patch Rotate Module, and channel-wise FC.
Patch Rotate: Reorganizes spatial pixel arrangements within a subset of channels, expanding the effective receptive field without parameters or convolution.
Channel-wise MLP processes rotated and reserved features, enabling long-range context modeling.

3. Mathematical Formulations and Decoding Workflow

PAC MLP Forward Pass

Given $x \in \mathbb{R}^N$ :

$\begin{align*} h^{(0)} &= x \ h^{(l)} &= \mathrm{ReLU}(W^{(l)} h^{(l-1)} + b^{(l)}) \quad (l=1,2,3) \ p &= \sigma(W^{(4)}h^{(3)} + b^{(4)}) \end{align*}$

Loss:

$L = -\frac{1}{B} \sum_{i=1}^B \sum_{j=1}^K \left[ d_j^{(i)} \log p_j^{(i)} + (1-d_j^{(i)})\log(1-p_j^{(i)}) \right]$

LDPC MLP Decoding

Let $\mathbf{c}^{(t)}$ denote the input vector at iteration $t$ :

Soft analog-XOR is applied per check node.
The sum-of-squares parity error $E(\mathbf{c})$ is minimized via gradient descent.
Update:

$c_i^{(t+1)} = c_i^{(t)} - \mu \frac{\partial E}{\partial c_i}$

CausalRN Sequence Update

At each position $j$ and layer $\ell$ :

$M_j^{(\ell)} = M_{j-1}^{(\ell)} + \exp(\tilde p_j^{(\ell)})$
$S_j^{(\ell)} = (\exp(\tilde q_j^{(\ell)}) \circ M_j^{(\ell)})/j$
$x_j^{(\ell)} = \mathrm{LayerNorm}(x_j^{(\ell-1)} + W_{out}^{(\ell)} S_j^{(\ell)})$

4. Empirical Results and Comparative Evaluation

Error-Correcting Decoding

PAC MLP decoder (Dai et al., 2024):
- Achieves FER within ≈0.1 dB of the dispersion-approximation bound for PAC(16,8).
- Outperforms CNN and RNN decoders by ≈0.5 dB FER at $10^{-3}$ error level.
- Decoding latency: 40 µs (CPU), 12 µs (GPU) per codeword.
LDPC MLP decoder (Karami et al., 2014):
- BER is within 0.1–0.2 dB of the Sum-Product Algorithm (SPA) for small codes, with nearly half the per-iteration multiplications.

Sequence and Language Modeling

CausalRN (Cui et al., 2024):
- Matches standard Transformer on copying tasks up to sequence length 256.
- Converges as fast or faster, with $O(1)$ per-token updates.
- Careful architectural ablations confirm that both exponential activation and pre-activation normalization are indispensable.
Sparse All-MLP (sMLP) (Yu et al., 2022):
- Validation perplexity outperforms or matches Transformer-MoEs and dense models at both 0.8T and 2.0T FLOPs/batch regime.
- Training throughput: up to 2× that of comparable MoE Transformer baselines.
- Zero-shot in-context learning performance meets or exceeds GPT-3 and state-of-the-art Transformer MoE models.

Vision

PRSeg (Ma et al., 2023):
- PRSeg-M (ResNet-50 backbone) achieves 42.36% mIoU on ADE20K (vs. 33.15% for SegFormer baseline) with identical FLOPs.
- On Cityscapes, PRSeg-M improves mIoU by 12.79% over SegFormer (ResNet-50).
- Spatial mixing via patch rotate expands ERF, matching or exceeding attention/convolutional heads in context modeling.

5. Implementation, Complexity, and Practical Considerations

All-MLP decoders are conducive to efficient hardware implementation due to their uniform matrix multiplications and lack of dependencies across tokens (in non-autoregressive configurations). Notable implementation details from leading studies:

Decoder	Parameters	CPU Latency (µs)	GPU Latency (µs)
PAC MLP (Dai et al., 2024)	37,000	40	12
CNN (PAC baseline)	37,848	60	15
RNN (PAC baseline)	39,608	50	13

Training details: Adam optimizer, batch size 512–320, up to 2²⁰ samples or large corpora, no explicit regularization commonly needed.
Parallel structure: Matrix operations dominate (e.g., dot products in MLPs), allowing efficient GPU/TPU utilization; all-MLP architectures fully benefit from parallel hardware with no sequential bottlenecks (Dai et al., 2024, Karami et al., 2014).

6. Limitations and Open Challenges

Receptive Field and Context Modeling: Pure channel-wise MLP decoders cannot expand receptive field without explicit spatial mixing or architectural modifications, as demonstrated by the need for patch rotate modules in PRSeg (Ma et al., 2023).
Expressiveness: Standard dense all-MLP models may lag in downstream task performance or long-range dependency modeling relative to attention-based models unless augmented with architectural innovations (e.g., CausalRN with expAct + pre-norm, sMLP with MoE).
Numerical Stability: Unbounded exponentials (CausalRN) can present overflow issues, mitigated by logsumexp rendering and normalization (Cui et al., 2024).
Scale and Convergence: For some code families (e.g., large LDPCs), complexity gains narrow as block size increases, and convergence may slow; non-convexity in iterative MLP LDPC decoding can cause local minima entrapment (Karami et al., 2014).

7. Extensions and Domain Applications

All-MLP decoders have been expanded to multiple modalities:

Code decoding: Neural decoding for PAC, LDPC, and convolutional codes (Dai et al., 2024, Karami et al., 2014).
Language modeling and sequence modeling: Sparse and causal MLP stacks rival Transformers for perplexity and in-context tasks (Cui et al., 2024, Yu et al., 2022).
Vision: PRSeg’s patch rotate architecture enables lightweight MLP segmentation decoders with competitive or superior results vs. convolutional and transformer-based heads (Ma et al., 2023).
Mixture-of-experts: Token- and feature-wise routing dramatically increases model capacity and efficiency, unlocking scaling advantages for large models (Yu et al., 2022).

This breadth of successful domain applications, coupled with their computational efficiency and parallelism, establishes all-MLP decoders as a foundational family of architectures in modern deep learning.