FECCT: Unified Transformer for Error Correction

Updated 9 February 2026

FECCT is a unified, transformer-based soft-decoder that standardizes error correction across LDPC, Polar, and BCH codes.
It employs a parity-aware sparse attention mechanism and a code-agnostic input encoding to reduce decoding complexity and latency.
The architecture enables plug-and-play adaptation with state-of-the-art BER and BLER performance while minimizing computational footprint.

The Foundation Error Correction Code Transformer (FECCT) is a unified, transformer-based soft-decoder architecture designed to decode a diverse range of error-correcting codes—such as LDPC, Polar, and BCH codes—using a single model backbone. FECCT replaces bespoke, code-specific hardware or neural decoders with a code-agnostic, parameter-sharing framework that utilizes standardized input processing, a parity-aware sparse attention mechanism, and unified model parameters. As demonstrated in recent works, FECCT achieves state-of-the-art or superior bit-error rate (BER) and block-error rate (BLER) performance while providing large reductions in decoding complexity, latency, and memory footprint. Its architecture and training regime enable plug-and-play adaptation to arbitrary code families and lengths, a feature that underpins its foundation-model designation for physical-layer channel decoding (Yan et al., 2024, Choukroun et al., 2022, Cho et al., 2 Feb 2026).

1. Unified Model Architecture and Input Encoding

FECCT fundamentally restructures the decoding pipeline for linear block codes by employing a multi-layer transformer encoder operating on standardized, code-agnostic representations. The input pipeline processes the real-valued noisy channel output $y \in \mathbb{R}^n$ (after BPSK modulation and AWGN channel), generating a concatenation of:

Absolute-value LLRs $\vert y \vert \in \mathbb{R}^n$ ,
Hard-decision syndromes $s(y) = H \cdot \mathrm{bin}(\mathrm{sign}(y)) \in \{0,1\}^{n-k}$ ,

forming $\tilde y = [\,|\!y|,\;s(y)\,]$ . To enable code-agnostic inference, both $\vert y \vert$ and $s(y)$ are zero-padded to global maxima $N_\mathrm{max}$ , $S_\mathrm{max}$ determined across all supported codes:

$\widetilde y = [\, [|y|,\,0_{N_\mathrm{max} - n}] ,\, [ s(y),\,0_{S_\mathrm{max} - (n-k)} ]\, ] \in \mathbb{R}^{(N_\mathrm{max} + S_\mathrm{max})}$

A learnable embedding matrix $E\in\mathbb R^{(2n-k)\times d_k}$ maps $\tilde y$ elementwise to $X = \tilde y \odot E$ .

Each input sequence $X$ is then processed by $L$ stacked transformer encoder layers, each augmented with a custom sparse, parity-aware unified attention module and position-wise feed-forward sublayers, followed by a two-layer MLP to produce $n$ soft multiplicative-noise estimates, which are mapped to the final bit estimates $\hat{x}$ . This enables training and inference with mixed code types, lengths, and rates, without architectural or model state changes (Yan et al., 2024).

2. Parity-Constrained Unified Attention Mechanism

The core innovation in FECCT’s architecture is the unified attention module that explicitly encodes code structure via sparsity masks derived from the code’s parity-check matrix $H$ :

$\begin{aligned} Q &= XW^Q, \quad K = XW^K, \quad V = XW^V, \ s_{ij} &= \frac{Q_i K_j^T}{\sqrt{d_k}} + M_{ij}, \ A_{ij} &= \mathrm{softmax}_j(s_{ij}), \ \text{UniAttn}(Q, K, V) &= AV\,. \end{aligned}$

The sparse mask $M_{ij}$ is defined from the extended bipartite adjacency matrix $\overline{H}$ : $M_{ij}=0$ if $\overline{H}_{ij}=1$ , $M_{ij}=-\infty$ otherwise, so attention is enforced only along edges representing parity constraints (variable and check nodes in the Tanner graph). This compresses quadratic attention to $O(H_d N d_\ell)$ , where $H_d$ is the parity-check matrix density, and injects hard code constraints while improving decoding accuracy and computational efficiency (Yan et al., 2024, Choukroun et al., 2022, Park et al., 2024).

3. Generalization Across Code Families and Rates

By using zero-padding and shared embedding/transformer modules, FECCT handles arbitrary codes with no changes to model topology. During training, samples from multiple code families (e.g., Polar (128,64), LDPC (96,48), BCH (63,36)) with varied lengths and code rates are mixed, and the same model backbone learns a general decoding mapping. Only the mask $M(H)$ is code-dependent; no per-code network retraining or separate weights are required, demonstrating strong generalization across codes (Yan et al., 2024, Cho et al., 2 Feb 2026).

Ablation studies indicate that training a single FECCT model on a mixture of code families does not degrade code-specific performance compared to separate models, confirming robust code-agnostic behavior. Growth in model depth (number of transformer layers) yields greater performance gains than simply increasing embedding dimension, underscoring the architectural importance of deep, nonlinear interaction modeling (Yan et al., 2024).

4. Training Objectives and Regularization

FECCT is supervised to output per-bit posteriors for the underlying multiplicative noise $z_i \in \{0,1\}$ , employing standard binary cross-entropy loss:

$\mathcal{L} = -\sum_{i=1}^n \big[ \tilde{z}_i \log f_\theta(y)_i + (1-\tilde{z}_i) \log (1-f_\theta(y)_i) \big]$

The attention mask sparsity acts as a strong structural prior. Empirically, including the mask reduces training loss by $45\%$ , accelerates convergence, and improves error rates by $0.4$–$0.5$ dB in both BER and BLER metrics at relevant thresholds (Yan et al., 2024, Park et al., 2024). No iterative message-passing or code-specific post-processing is required: all error correction is achieved in a fixed number of transformer passes.

5. Decoding Efficiency and Empirical Performance

The sparse attention mechanism drastically reduces computational complexity. Each transformer layer runs in $O(H (H_d d_k d_\ell N + d_k^2))$ FLOPs versus standard $O(H (d_k N^2 + d_k^2))$ , and wall-clock latency is similarly reduced. In practice, FLOPs are reduced by up to $83\%$ per block, enabling full-decoder throughput improvements over e.g. neural BP decoders (Yan et al., 2024, Cho et al., 2 Feb 2026).

Empirical evaluation across Polar, LDPC, and BCH codes shows that the unified FECCT model closely matches or surpasses code-optimized neural or classical decoders. Notable results include BER $10^{-5}$ on Polar(128,64) at $E_b/N_0=5$ dB (a $0.3$ dB improvement over SCL-8 and on par with normalized min-sum LDPC up to $10^{-6}$ ). Joint training does not compromise per-family performance (Yan et al., 2024, Park et al., 2024).

6. Model Compression and Deployment: Spectral-Aligned Pruning

Recent advances leverage model compression techniques tailored to FECCT’s universal role (Cho et al., 2 Feb 2026). Spectral-Aligned Pruning (SAP) creates structured pruning masks for attention heads and feed-forward channels, guided by the spectrum of the PCM-induced bipartite graph. Pruning reduces backbone FLOPs and parameters by $40\%$ , while cross-code reuse of masks is enabled by comparing spectral signatures (top $K$ eigenvalues). For new codes, the nearest-neighbor pruning mask (w.r.t. spectral distance) is reused; if similarity is too low, a new mask is derived, supporting memory-efficient many-code libraries.

Performance after SAP and per-code LoRA low-rank adaptation remains within $0.05$–$0.15$ in negative log BER relative to dedicated per-code pruning, with only $7\%$ additional adapter parameters stored per code. The SAP approach maintains the universal decoder property, permitting practical FECCT deployment in memory- and compute-constrained environments (Cho et al., 2 Feb 2026).

7. Broader Implications and Foundation-Model Role

FECCT represents a paradigm shift in channel decoder design for emerging wireless standards (notably 6G), where a single hardware/software module can support heterogeneous code families, lengths, and rates simply by reloading the correct parity-check-derived mask. This enables real-time, flexible error correction with a reduced development footprint and opens the door for rapid adoption of new code families.

By consolidating code-specific pipelines into a highly adaptable transformer backbone, FECCT assumes the role of a physical-layer foundation model. Its flexible input/output interface and code-agnostic architecture allow seamless transfer and fine-tuning for new codes, pragmatic scaling with model-pruning, and practical deployment in latency- and memory-constrained scenarios (Yan et al., 2024, Cho et al., 2 Feb 2026). This foundation-model property is further leveraged by transfer learning and modular adaptation strategies as outlined in related transformer ECC research (Wang et al., 2023, Park et al., 2024).

Key Papers

Paper Title	arXiv ID	Topic/Contribution
Error Correction Code Transformer: From Non-Unified to Unified	(Yan et al., 2024)	Unified code-agnostic transformer ECC architecture
Error Correction Code Transformer	(Choukroun et al., 2022)	Early transformer decoder with parity-mask
Spectral-Aligned Pruning for Universal Error-Correcting Code Transformers	(Cho et al., 2 Feb 2026)	SAP for efficient, pruned foundation ECC decoders
CrossMPT: Cross-attention Message-Passing Transformer for ECC	(Park et al., 2024)	Masked message-passing transformer, code modularity
Transformer-QEC: Quantum Error Correction Code Decoding	(Wang et al., 2023)	Variable-length, foundation transformer for QEC