Transformers vs. Automata: Expressivity & Succinctness

Updated 20 November 2025

The paper demonstrates that transformers can recognize specific languages with only O(n) parameters, while equivalent DFAs require a doubly-exponential number of states.
It employs algebraic and circuit-theoretic methods, using Fourier-module computations and parallel prefix scans to simulate automata efficiently.
Empirical insights reveal that transformer models can simulate star-free and modular languages, though verification remains computationally challenging.

Transformers are sequence models that can simulate, compress, and efficiently compute many string-processing tasks classically handled by automata. A central topic in recent research is to delineate their expressive power compared to standard finite automata (DFAs/NFAs), both qualitatively (what classes of languages or functions can be recognized) and quantitatively (how succinctly each model describes those languages). This article gives a comprehensive account of the expressivity and succinctness of transformers versus automata, drawing on both theoretical and empirical results.

1. Succinctness and Expressive Gaps: Formal Definitions and Main Theorems

Succinctness is defined as the bit-length or parameter count of a standard encoding for recognizing a particular language. Let $\mathcal{R}$ be a finite object encoding a language recognizer (e.g., a transformer model, a DFA, or an LTL formula); $|\mathcal{R}|$ denotes its encoding size in bits. Comparing representation classes, we say class $\mathcal{C}_1$ is (doubly-)exponentially more succinct than class $\mathcal{C}_2$ if any function in $2^{(2^{)o(n)}}$ lower-bounds the blow-up in representation size for equivalent descriptions.

The key result is an explicit, provable doubly-exponential succinctness gap:

There exist languages $L_n$ such that a UHAT-transformer of size $O(n)$ recognizes $L_n$ , but any DFA (or NFA) recognizing $L_n$ must have at least $2^{2^{\Omega(n)}}$ states (Theorem A) (Bergsträßer et al., 22 Oct 2025).
For LTL (linear temporal logic), the gap is merely exponential: any LTL recognizing the same $L_n$ must have size at least $2^{\Omega(n)}$ .

These results are established via a construction using tiling counter languages. The transformer exploits masked attention to efficiently verify doubly-exponential-length constraints using only $O(n)$ parameters, while DFA and LTL representations unavoidably expand to astronomical size.

Transformers are exponentially more succinct than RNNs as well: an RNN with $d$ hidden units of $k$ -bit precision is simulated by a DFA of $2^{kd}$ states, while the same language may require only $O(d)$ -sized transformer (Bergsträßer et al., 22 Oct 2025).

2. Transformer Simulations of Automata: Algebraic and Circuit-Theoretic View

The simulation of automata by transformers can be made precise via representation theory. For automata with states forming a group (e.g., $\mathbb{Z}_p$ ), a constant-depth, $O(p)$ -width transformer using Fourier-module computations exactly simulates the automaton on any input length. For general transition monoids, the semidirect product structure supports log-depth parallel prefix scan algorithms in transformers, with width $O(|Q|)$ and $O(\log T \cdot \log |Q|)$ depth (Zhang, 29 Apr 2025).

Empirically, using these algebraic encodings, transformers have been trained to simulate modular group automata, complex permutation semigroups, and “shortcut” solutions via Krohn–Rhodes decompositions, achieving substantial depth savings (often $\ll T$ layers for length- $T$ tasks) (Liu et al., 2022, Zhang, 29 Apr 2025).

In terms of circuit complexity:

Log-depth transformers with sufficient width and heads can simulate all regular languages ( $\mathsf{NC}^1$ ) in idealized (arbitrary precision) settings (Zhang, 29 Apr 2025, Liu et al., 2022).
For regular languages whose transition monoid is solvable, $O(1)$ transformer depth is sufficient.

However, when restricting attention mechanisms and numerical precision, the expressivity falls below full $\mathsf{NC}^1$ —for standard finite-depth, fixed-precision, (softmax or hard) attention transformers, only a strict subclass of regular languages is recognized (Strobl et al., 2023, Bhattamishra et al., 2020).

3. Exact Language Classes Recognized: Star-Free, Regular, and Beyond

The class of languages exactly recognized by transformers is sharply dependent on architectural constraints:

Strict masked hard-attention, no position encodings: These transformers recognize exactly the star-free regular languages, i.e., those definable in FO $[<]$ or linear temporal logic (LTL) (Yang et al., 2023, Strobl et al., 2023, Lin et al., 28 Sep 2025). Their automata-theoretic equivalent is counter-free DFA.
Softmax attention, fixed (finite) precision: Expressivity aligns with the past-fragment of LTL (pTL) or “left-deterministic polynomial” languages, a subclass of regular languages recognized by partially ordered DFA or J-trivial monoids (Li et al., 29 May 2025).
Average-hard attention or softmax attention with modular position encodings: These models capture all regular languages expressible in FO $[<, \mathrm{MOD}]$ , encompassing parity and bounded-Dyck languages if allowed suitable PEs, but still not the full regular class without further design (Strobl et al., 2023, Yang et al., 2024, Lin et al., 28 Sep 2025).
Log-depth transformer or CoT-augmented decoder-only transformers: With a linear (in input length) chain-of-thought, decoder-transformers simulate any regular language ( $\mathsf{CoT}(\Theta(n)) \supseteq$ REG), while polynomial CoT steps attain $\mathsf{P}$ (Merrill et al., 2023, Strobl et al., 2023).
Extended context window or chain-of-thought: A constant-parameter, fixed-precision decoder transformer with context window $s(n)$ can simulate any Turing machine using at most $s(n)$ space; thus, with $s(n) = n$ or $s(n) = \mathrm{poly}(n)$ , the transformer attains $\mathrm{SPACE}[s(n)]$ expressivity ( $\mathrm{PSPACE}$ for polynomial window) (Li et al., 22 May 2025).

An important clarification is that standard, non-CoT, bounded-depth transformers (fixed number of heads, layers, and finite-precision) realize only a proper subclass of regular languages—the boundary is formally characterized and is well below the DFA-top regular class (Strobl et al., 2023, Chiang et al., 2023, Li et al., 29 May 2025).

4. Mechanisms of Succinctness: Attention, Masking, and Algebraic Compression

Transformers achieve dramatic succinctness via the ability of attention layers to “index” and “reuse” information flexibly. In Example 3.1 from (Bergsträßer et al., 22 Oct 2025), at each delimiter token (“#”) the attention mechanism allows the transformer to “jump” $O(n)$ bits to retrieve relevant binary blocks, implement address-increment, and resolve constraints that would require a DFA to enumerate all $2^n$ counters—hence yielding doubly exponential succinctness.

The construction hinges on:

Multiple attention heads operating in parallel, each handling a distinct aspect (address or symbol) with Boolean test logic embedded in the “value” streams.
The association between attention masking (strict future/past) and temporal logic operators (“until”/“since”) directly mapping to LTL expressivity (Yang et al., 2023).
Boolean RASP as an intermediate language: each transformer can be compiled from, and decompiled to, an indexed Boolean program or LTL formula, with attention directly corresponding to LTL’s temporal modalities (Yang et al., 2023).

For more complex algebraic automata (semidirect products, group actions), algebraic analysis allows implementations with controlled width and depth, leveraging parallel prefix-scan algorithms and group/monoid module embeddings (Zhang, 29 Apr 2025, Liu et al., 2022).

5. Complexity of Verification and Model-Checking

The succinctness advantage comes at a significant cost in the computational complexity of analysis and verification:

Deciding emptiness, universality, or equivalence for even the “weakest” (fixed-precision, unique-hard-attention) transformer is EXPSPACE-complete (Bergsträßer et al., 22 Oct 2025). This is in contrast to P-time model-checking for DFAs and even PSPACE-completeness for LTL [Sistla–Clarke ’85 cited in (Bergsträßer et al., 22 Oct 2025)].
The reduction is via encoding tiling problems (canonical EXPSPACE-complete) into transformer language acceptance, exploiting the capacity of masked attention to simulate doubly-exponential counters efficiently.

This fundamental complexity barrier means that the verification or minimization of transformer-based representations is likely infeasible for practical cases when leveraging their full succinctness capabilities.

6. Practical and Empirical Implications

Empirical automata extraction confirms the theoretical insights:

For “small” or “locally checkable” star-free regular languages (e.g., (aa), (abab), Tomita 1,2,4,7), transformer models trained with finite heads, layers, and dimensions can simulate the minimal DFA exactly (Zhang et al., 2024).
For modular counting and higher dot-depth languages (e.g., parity, mod-3, mod-5), standard transformers with fixed resources fail to generalize and cannot emulate the minimal DFA; they may only memorize finite samples, leading to vastly overcomplex extracted automata or collapse to random guessing (Zhang et al., 2024, Bhattamishra et al., 2020).

Designing transformers to guarantee recognition of specific regular or modular languages requires architectural scaling tailored to the algebraic complexity (e.g., more heads/layers or explicit count modules).

7. Conceptual and Complexity-Theoretic Landscape

Summarizing across model classes, attention regimes, and resource constraints, the expressivity—and succinctness—of transformers relative to automata is governed by:

Transformer Type / Resource	Attention	Positional Encoding	Expressive Class	Succinctness Gap to DFA	Verification Complexity
Masked UHAT, no PE	Hard argmax	None	Star-free (LTL)	Doubly exponential	EXPSPACE-complete
Masked softmax, finite prec.	Softmax	None	Past-LTL (pTL)	Exponential (to LTL)	—
Log-depth, unbounded width	Hard/Soft	Sufficient	Regular (DFAs)	Polynomial	P
Bounded-depth/width/prec.	Hard/Soft	Limited	Sub-regular	—	P
Linear/Poly CoT + Large Context	Softmax	General	Context-sensitive/P/RE	(trivial)	—

The boundary circumscribing transformer expressivity thus sits sharply at the intersection of circuit-theoretic, algebraic, and logic-based characterizations, and the succinctness explosion is intricately coupled to complexity-theoretic intractability (Bergsträßer et al., 22 Oct 2025, Strobl et al., 2023, Lin et al., 28 Sep 2025).

References

"Transformers are Inherently Succinct" (Bergsträßer et al., 22 Oct 2025)
"Partial Answer of How Transformers Learn Automata" (Zhang, 29 Apr 2025)
"Automata Extraction from Transformers" (Zhang et al., 2024)
"Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages" (Yang et al., 2023)
"The Role of Logic and Automata in Understanding Transformers" (Lin et al., 28 Sep 2025)
"On the Ability and Limitations of Transformers to Recognize Formal Languages" (Bhattamishra et al., 2020)
"Transformers Learn Shortcuts to Automata" (Liu et al., 2022)
"What Formal Languages Can Transformers Express? A Survey" (Strobl et al., 2023)