Encoder-Only Transformer
- Encoder-Only Transformers are neural network architectures that rely solely on stacked self-attention layers to process input bidirectionally, underpinning models like BERT and RoBERTa.
- They demonstrate improved sample efficiency and constant-depth expressive power in tasks such as arithmetic, in-context learning, and structured perception compared to decoder-only models.
- Despite strong performance in language understanding, encoder-only models face high computational costs for autoregressive tasks, prompting research into hybrid architectures and efficient caching methods.
An encoder-only transformer is a neural network architecture in which all computation is performed by stacked self-attention layers without the presence or separation of decoder modules. In this scheme, each token in the input sequence can attend bidirectionally to every other token, and all outputs are produced from the resulting contextual representations. Encoder-only models are foundational in language understanding tasks (e.g., BERT, RoBERTa), have gained traction in generative modeling and structured prediction, and are increasingly investigated as universal sequence processors, especially in regimes unconstrained by generation efficiency constraints.
1. Encoder-Only Transformer Architecture
The canonical encoder-only transformer accepts a token sequence drawn from a vocabulary . Each token is mapped to an embedding , with added learnable positional encoding : The stacked encoder comprises identical blocks. Each block, at position :
- Projects previous-layer hidden into queries, keys, and values (with learned matrices ):
- Applies full self-attention:
- Uses residual connections, layer normalization, and an MLP for further transformation:
The output representation at a specific position (e.g., for next-token prediction) is typically fed into a linear + softmax head for task targets (Ewer et al., 2024).
Compared to encoder-decoder or decoder-only designs, encoder-only transformers operate without causal masks by default, granting bidirectional or global context to all positions.
2. Expressive Power, Complexity, and Theoretical Properties
Encoder-only and decoder-only transformers each possess unique expressive regimes. Formal results show:
- For every encoder-only transformer of depth , there exist causal functions expressible in constant depth that no decoder-only of the same size can model [(Ewer et al., 2024), Theorem 4.2]. Explicitly, the "Count3" function is computable in constant-depth encoder-only but requires depth in any decoder-only transformer when is the input length.
- Conversely, decoder-only transformers also possess function classes inaccessible to encoder-only models of bounded depth [(Ewer et al., 2024), Theorem 4.1].
From the computational perspective:
- For sequence generation, each new token prediction with an encoder-only transformer incurs full per-token computation and for a full sequence, as the transformer must recompute attention for the augmented input at each step.
- Decoder-only models, leveraging KV-caching and causal attention, achieve per token and for the entire sequence.
In space complexity, encoder-only models do not require persistent state between steps but use additional memory per token, while decoder-only models maintain past KV-matrices with additional memory per token (Ewer et al., 2024).
3. Empirical Performance and Applications
Encoder-only transformers have shown empirical advantages and competitive results across a range of tasks:
- Arithmetic and Algorithmic Tasks: On the Count3 synthetic benchmark introduced in (Ewer et al., 2024), encoder-only models reach perfect accuracy with constant-depth architectures, while decoder-only models consistently fail to learn the task, even with large-scale models (e.g., Llama-3-8B, GPT-4o).
- Addition and Generalization: Encoder-only transformers trained on arithmetic addition demonstrate better sample efficiency and superior length generalization compared to decoder-only models—maintaining robust performance as input length grows.
- In-Context Learning: Encoder-only architectures outperform or match decoders in regression tasks on function classes such as linear, sparse linear, deep neural nets, and decision trees.
- Language Modeling: On OpenWebText, encoder-only and decoder-only show comparable or slightly better validation perplexity for the encoder-only model (104.6 vs. 110.5) (Ewer et al., 2024).
- Structured Perception: TABLET—a dense table structure recognition model—utilizes three encoder-only transformer modules to partition and merge table rows/columns, achieving state-of-the-art structure accuracy and throughput compared to autoregressive baselines (Hou et al., 8 Jun 2025).
- ASR: The UniEnc-CASSNAT model demonstrates that ASR can be realized with a single encoder and two passes, attaining state-of-the-art non-autoregressive WERs and parameter efficiency (Fan et al., 2024). Aligner-Encoders further show that a single conformer-style encoder can autonomously perform alignment for speech-to-text conversion, drastically reducing inference latency (Stooke et al., 6 Feb 2025).
- Recommendation: Sequential Masked Modeling with encoder-only architectures (e.g., BERT-SMM) surpasses both standard MLM and causal LLMs in next-item session-based recommendation, especially when using penultimate-token masking and window sliding (Redjdal et al., 2024).
- Machine Translation: Single-stack encoder-only transformers can match standard encoder-decoder transformers in BLEU across bilingual, monolingual-augmented, and multilingual NMT benchmarks, given matched parameterization and use of block-masking on the concatenated source-target sequence (Gao et al., 2022).
4. Convergence Analysis and Training Dynamics
Comprehensive theoretical analysis of encoder-only transformers under finite-width regimes demonstrates:
- With He/LeCun initialization and attention scaling, global convergence of shallow encoder-only transformers is guaranteed under cubic model width (where is the data set size). With NTK scaling (), quadratic width suffices, although attention degenerates to uniform pooling (Wu et al., 2023).
- Infinite-width NTK analysis indicates only linear overparameterization is required for convergence, but such networks lose the representational power associated with nontrivial attention mechanisms.
These results clarify that attention scaling and initialization scheme selection crucially affect both convergence and the expressive regime of the model. Feature learning (finite-width regime) is preferable where nontrivial attention is desired (Wu et al., 2023).
5. Encoder-Only Transformers in Model Adaptation and Pooling
Adapting decoder-only transformers for encoder tasks has proven effective:
- Gemma Encoder demonstrates that switching a decoder-only (causal) transformer to bidirectional attention, attaching a lightweight pooling+MLP head, and applying dropout during fine-tuning yields models that outperform or match best-in-class encoder-only architectures across classification, regression, and ranking tasks (e.g., GLUE, MS MARCO) (Suganthan et al., 4 Mar 2025).
- Various pooling strategies (mean pooling, last-token, attention pooling with learned query probes) have been benchmarked; in practical scenarios, simple mean or last-token pooling is often optimal when fine-tuning on moderate corpora.
This approach provides a general recipe for transforming large pre-trained causal LLMs into high-performing encoder-only models.
6. Interpretability, Reasoning, and Analysis
Encoder-only transformers exhibit layers with specialized processing roles:
- Fine-tuned models can achieve high accuracy on formal reasoning tasks (deductive logic, FOL, PC), but performance is dataset-specific and reasoning ability localizes in higher transformer layers. Transfer across datasets remains weak, indicating reliance on data heuristics rather than true inference (Pirozelli et al., 2023).
- Attention visualizer frameworks leveraging encoder-only models (e.g., RoBERTa-base) distill self-attention patterns into word-level importance maps, occlusion analysis, and heatmap visualizations. These tools highlight distributional attention to named entities, word structure, and other salient tokens (Falaki et al., 2023).
Such analysis accelerates model debugging, interpretability research, and the development of targeted probing methodologies.
7. Limitations and Future Directions
While encoder-only transformers display impressive theoretical and empirical properties, several limitations and open questions persist:
- Autoregressive Generation Efficiency: The per-token and per-sequence computational costs make encoder-only transformers impractical for large-scale or real-time generative applications, pending novel architectural or systems-level acceleration techniques (Ewer et al., 2024).
- Sample Efficiency in Complex Regimes: On tasks requiring strong transfer, logical generalization, or rare pattern discrimination, encoder-only models can overfit to spurious dataset signals without acquiring genuine general capabilities (Pirozelli et al., 2023).
- Handling Global Context and Length: Efficiently marrying encoder-only design with very long sequence processing (document-level, speech, code) or streaming/online requirements remains under-explored.
- Hybrid and Adapted Architectures: Open questions include the development of effective KV-caching for encoder-only models (to ameliorate generation costs), hybrid encoder-decoder-encoder architectures, and systematic exploration of encoder-only fine-tuning for vast, instruction-tuned LLMs (Ewer et al., 2024, Suganthan et al., 4 Mar 2025).
- Theory for Deep Stacks and Partial Attention: While convergence is well-understood for shallow, single-head encoder-only models, deeper models, multi-head variance, and residual/skip connections call for sharper non-asymptotic theory.
A plausible implication is that encoder-only architectures will assume a larger role as compute constraints relax and as tasks move toward more general, unified sequence modeling regimes. Hybrid approaches, novel caching, or algorithmic innovations may unlock broader deployment in autoregressive or multimodal domains.