Papers
Topics
Authors
Recent
2000 character limit reached

Transformers vs. Automata: Expressivity & Succinctness

Updated 20 November 2025
  • The paper demonstrates that transformers can recognize specific languages with only O(n) parameters, while equivalent DFAs require a doubly-exponential number of states.
  • It employs algebraic and circuit-theoretic methods, using Fourier-module computations and parallel prefix scans to simulate automata efficiently.
  • Empirical insights reveal that transformer models can simulate star-free and modular languages, though verification remains computationally challenging.

Transformers are sequence models that can simulate, compress, and efficiently compute many string-processing tasks classically handled by automata. A central topic in recent research is to delineate their expressive power compared to standard finite automata (DFAs/NFAs), both qualitatively (what classes of languages or functions can be recognized) and quantitatively (how succinctly each model describes those languages). This article gives a comprehensive account of the expressivity and succinctness of transformers versus automata, drawing on both theoretical and empirical results.

1. Succinctness and Expressive Gaps: Formal Definitions and Main Theorems

Succinctness is defined as the bit-length or parameter count of a standard encoding for recognizing a particular language. Let R\mathcal{R} be a finite object encoding a language recognizer (e.g., a transformer model, a DFA, or an LTL formula); R|\mathcal{R}| denotes its encoding size in bits. Comparing representation classes, we say class C1\mathcal{C}_1 is (doubly-)exponentially more succinct than class C2\mathcal{C}_2 if any function in 2(2)o(n)2^{(2^{)o(n)}} lower-bounds the blow-up in representation size for equivalent descriptions.

The key result is an explicit, provable doubly-exponential succinctness gap:

  • There exist languages LnL_n such that a UHAT-transformer of size O(n)O(n) recognizes LnL_n, but any DFA (or NFA) recognizing LnL_n must have at least 22Ω(n)2^{2^{\Omega(n)}} states (Theorem A) (Bergsträßer et al., 22 Oct 2025).
  • For LTL (linear temporal logic), the gap is merely exponential: any LTL recognizing the same LnL_n must have size at least 2Ω(n)2^{\Omega(n)}.

These results are established via a construction using tiling counter languages. The transformer exploits masked attention to efficiently verify doubly-exponential-length constraints using only O(n)O(n) parameters, while DFA and LTL representations unavoidably expand to astronomical size.

Transformers are exponentially more succinct than RNNs as well: an RNN with dd hidden units of kk-bit precision is simulated by a DFA of 2kd2^{kd} states, while the same language may require only O(d)O(d)-sized transformer (Bergsträßer et al., 22 Oct 2025).

2. Transformer Simulations of Automata: Algebraic and Circuit-Theoretic View

The simulation of automata by transformers can be made precise via representation theory. For automata with states forming a group (e.g., Zp\mathbb{Z}_p), a constant-depth, O(p)O(p)-width transformer using Fourier-module computations exactly simulates the automaton on any input length. For general transition monoids, the semidirect product structure supports log-depth parallel prefix scan algorithms in transformers, with width O(Q)O(|Q|) and O(logTlogQ)O(\log T \cdot \log |Q|) depth (Zhang, 29 Apr 2025).

Empirically, using these algebraic encodings, transformers have been trained to simulate modular group automata, complex permutation semigroups, and “shortcut” solutions via Krohn–Rhodes decompositions, achieving substantial depth savings (often T\ll T layers for length-TT tasks) (Liu et al., 2022, Zhang, 29 Apr 2025).

In terms of circuit complexity:

  • Log-depth transformers with sufficient width and heads can simulate all regular languages (NC1\mathsf{NC}^1) in idealized (arbitrary precision) settings (Zhang, 29 Apr 2025, Liu et al., 2022).
  • For regular languages whose transition monoid is solvable, O(1)O(1) transformer depth is sufficient.

However, when restricting attention mechanisms and numerical precision, the expressivity falls below full NC1\mathsf{NC}^1—for standard finite-depth, fixed-precision, (softmax or hard) attention transformers, only a strict subclass of regular languages is recognized (Strobl et al., 2023, Bhattamishra et al., 2020).

3. Exact Language Classes Recognized: Star-Free, Regular, and Beyond

The class of languages exactly recognized by transformers is sharply dependent on architectural constraints:

  • Strict masked hard-attention, no position encodings: These transformers recognize exactly the star-free regular languages, i.e., those definable in FO[<][<] or linear temporal logic (LTL) (Yang et al., 2023, Strobl et al., 2023, Lin et al., 28 Sep 2025). Their automata-theoretic equivalent is counter-free DFA.
  • Softmax attention, fixed (finite) precision: Expressivity aligns with the past-fragment of LTL (pTL) or “left-deterministic polynomial” languages, a subclass of regular languages recognized by partially ordered DFA or J-trivial monoids (Li et al., 29 May 2025).
  • Average-hard attention or softmax attention with modular position encodings: These models capture all regular languages expressible in FO[<,MOD][<, \mathrm{MOD}], encompassing parity and bounded-Dyck languages if allowed suitable PEs, but still not the full regular class without further design (Strobl et al., 2023, Yang et al., 2024, Lin et al., 28 Sep 2025).
  • Log-depth transformer or CoT-augmented decoder-only transformers: With a linear (in input length) chain-of-thought, decoder-transformers simulate any regular language (CoT(Θ(n))\mathsf{CoT}(\Theta(n)) \supseteq REG), while polynomial CoT steps attain P\mathsf{P} (Merrill et al., 2023, Strobl et al., 2023).
  • Extended context window or chain-of-thought: A constant-parameter, fixed-precision decoder transformer with context window s(n)s(n) can simulate any Turing machine using at most s(n)s(n) space; thus, with s(n)=ns(n) = n or s(n)=poly(n)s(n) = \mathrm{poly}(n), the transformer attains SPACE[s(n)]\mathrm{SPACE}[s(n)] expressivity (PSPACE\mathrm{PSPACE} for polynomial window) (Li et al., 22 May 2025).

An important clarification is that standard, non-CoT, bounded-depth transformers (fixed number of heads, layers, and finite-precision) realize only a proper subclass of regular languages—the boundary is formally characterized and is well below the DFA-top regular class (Strobl et al., 2023, Chiang et al., 2023, Li et al., 29 May 2025).

4. Mechanisms of Succinctness: Attention, Masking, and Algebraic Compression

Transformers achieve dramatic succinctness via the ability of attention layers to “index” and “reuse” information flexibly. In Example 3.1 from (Bergsträßer et al., 22 Oct 2025), at each delimiter token (“#”) the attention mechanism allows the transformer to “jump” O(n)O(n) bits to retrieve relevant binary blocks, implement address-increment, and resolve constraints that would require a DFA to enumerate all 2n2^n counters—hence yielding doubly exponential succinctness.

The construction hinges on:

  • Multiple attention heads operating in parallel, each handling a distinct aspect (address or symbol) with Boolean test logic embedded in the “value” streams.
  • The association between attention masking (strict future/past) and temporal logic operators (“until”/“since”) directly mapping to LTL expressivity (Yang et al., 2023).
  • Boolean RASP as an intermediate language: each transformer can be compiled from, and decompiled to, an indexed Boolean program or LTL formula, with attention directly corresponding to LTL’s temporal modalities (Yang et al., 2023).

For more complex algebraic automata (semidirect products, group actions), algebraic analysis allows implementations with controlled width and depth, leveraging parallel prefix-scan algorithms and group/monoid module embeddings (Zhang, 29 Apr 2025, Liu et al., 2022).

5. Complexity of Verification and Model-Checking

The succinctness advantage comes at a significant cost in the computational complexity of analysis and verification:

  • Deciding emptiness, universality, or equivalence for even the “weakest” (fixed-precision, unique-hard-attention) transformer is EXPSPACE-complete (Bergsträßer et al., 22 Oct 2025). This is in contrast to P-time model-checking for DFAs and even PSPACE-completeness for LTL [Sistla–Clarke ’85 cited in (Bergsträßer et al., 22 Oct 2025)].
  • The reduction is via encoding tiling problems (canonical EXPSPACE-complete) into transformer language acceptance, exploiting the capacity of masked attention to simulate doubly-exponential counters efficiently.

This fundamental complexity barrier means that the verification or minimization of transformer-based representations is likely infeasible for practical cases when leveraging their full succinctness capabilities.

6. Practical and Empirical Implications

Empirical automata extraction confirms the theoretical insights:

  • For “small” or “locally checkable” star-free regular languages (e.g., (aa), (abab), Tomita 1,2,4,7), transformer models trained with finite heads, layers, and dimensions can simulate the minimal DFA exactly (Zhang et al., 2024).
  • For modular counting and higher dot-depth languages (e.g., parity, mod-3, mod-5), standard transformers with fixed resources fail to generalize and cannot emulate the minimal DFA; they may only memorize finite samples, leading to vastly overcomplex extracted automata or collapse to random guessing (Zhang et al., 2024, Bhattamishra et al., 2020).

Designing transformers to guarantee recognition of specific regular or modular languages requires architectural scaling tailored to the algebraic complexity (e.g., more heads/layers or explicit count modules).

7. Conceptual and Complexity-Theoretic Landscape

Summarizing across model classes, attention regimes, and resource constraints, the expressivity—and succinctness—of transformers relative to automata is governed by:

Transformer Type / Resource Attention Positional Encoding Expressive Class Succinctness Gap to DFA Verification Complexity
Masked UHAT, no PE Hard argmax None Star-free (LTL) Doubly exponential EXPSPACE-complete
Masked softmax, finite prec. Softmax None Past-LTL (pTL) Exponential (to LTL)
Log-depth, unbounded width Hard/Soft Sufficient Regular (DFAs) Polynomial P
Bounded-depth/width/prec. Hard/Soft Limited Sub-regular P
Linear/Poly CoT + Large Context Softmax General Context-sensitive/P/RE (trivial)

The boundary circumscribing transformer expressivity thus sits sharply at the intersection of circuit-theoretic, algebraic, and logic-based characterizations, and the succinctness explosion is intricately coupled to complexity-theoretic intractability (Bergsträßer et al., 22 Oct 2025, Strobl et al., 2023, Lin et al., 28 Sep 2025).

References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Expressivity of Transformers vs. Automata.