Expressivity of Transformers: A Tropical Geometry Perspective

Published 16 Apr 2026 in cs.LG | (2604.14727v1)

Abstract: To quantify the geometric expressivity of transformers, we introduce a tropical geometry framework to characterize their exact spatial partitioning capabilities. By modeling self-attention as a vector-valued tropical rational map, we prove it evaluates exactly to a Power Voronoi Diagram in the zero-temperature limit. Building on this equivalence, we establish a combinatorial rationale for Multi-Head Self-Attention (MHSA): via the Minkowski sum of Newton polytopes, multi-head aggregation expands the polyhedral complexity to $\mathcal{O}(N^H)$, overcoming the $\mathcal{O}(N)$ bottleneck of single heads. Extending this to deep architectures, we derive the first tight asymptotic bounds on the number of linear regions in transformers ($Θ(N^{{d_{\text{model}}L})$),} demonstrating a combinatorial explosion driven intrinsically by sequence length $N$, ambient embedding dimension $d_{\text{model}}$, and network depth $L$. Importantly, we guarantee that this idealized polyhedral skeleton is geometrically stable: finite-temperature soft attention preserves these topological partitions via exponentially tight differential approximation bounds.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper establishes a precise equivalence between transformer self-attention in the zero-temperature limit and Power Voronoi diagrams, quantifying input space partitioning.
It derives sharp asymptotic bounds on the number of linear regions as a function of sequence length, head count, and depth, proving exponential gains in multi-head architectures.
The study introduces a tropical algebraic framework to model self-attention and demonstrates its stability under finite softmax temperatures with implications for architectural tuning.

Expressive Power of Transformers via Tropical Geometry

Introduction and Motivation

The expressivity and theoretical capacity of Transformer architectures have been extensively debated in deep learning theory, yet most prior work addresses functional expressivity (e.g., universality or Turing completeness) without a quantified geometric theory of how transformers partition their input spaces. This paper, "Expressivity of Transformers: A Tropical Geometry Perspective" (2604.14727), develops a rigorous geometric analysis of transformer expressivity by leveraging tropical (max-plus) geometry. Central results include a precise equivalence between self-attention in the zero-temperature limit and Power Voronoi diagrams, detailed combinatorial analysis of the geometric complexity explosion induced by sequence length $N$ , number of heads $H$ , and depth $L$ , and sharp asymptotic and constructive bounds on the number of linear regions implemented by deep transformer networks.

Linear Regions and Piecewise-Linear Partitioning

The study adopts a geometric expressivity measure from the theory of continuous piecewise-linear (CPWL) functions: the count $\mathcal{N}$ of maximal linear regions in the input space, generalizing existing results for ReLU-based MLPs [montufar2014number, serra2018bounding, hanin2019complexity]. Each region corresponds to a distinct affine map prescribed by the network, providing a meaningful quantification of nonlinearity and local function complexity.

For MLPs, the combinatorial growth of regions via recursive space folding is classical (Figure 1):

Figure 1: Recursive spatial partitioning in a 2-layer MLP with $d=2$ ; Layer 2 recursively shatters the space through affine composition, multiplying the region count.

However, the softmax-based self-attention in transformers is not piecewise-linear, complicating standard hyperplane arrangement analysis. The paper resolves this by mapping attention's log-sum-exp operations to the tropical semiring through Maslov dequantization.

Tropical Algebraic Framework for Self-Attention

In the max-plus semiring ( $\oplus = \max$ , $\otimes = +$ ), tropical polynomials naturally model CPWL maps. The authors rigorously demonstrate that, under the $\tau \to 0$ (zero-temperature) limit, softmax attention reduces to maximizing inner products: every attention module then implements a piecewise-constant map, with hard boundaries determined by the arrangement of key vectors in representation space.

The key insight is to analyze self-attention as a vector-valued tropical rational map, employing log-lifting to avoid metric collapse and allow asymptotic analysis. This leads to the identification of geometric routing boundaries with the normal fan of a Newton polytope corresponding to the set of keys.

Self-Attention and Power Voronoi Diagrams

The principal geometric result is a precise equivalence: in the zero-temperature limit, transformer self-attention partitions projected query space $\mathbb{R}^{d_k}$ exactly according to a Power Voronoi diagram, where sites are key vectors and weights are their squared norms. This not only elucidates the boundary geometry—affine, possibly empty or degenerate cells—but also provides a foundation for further combinatorial analysis (Figure 2):

Figure 2: Partitioning of the query space $\mathbb{R}^{d_k}$ , showing transformation from standard Voronoi to Power Voronoi under dot-product attention.

The practical import is twofold:

Piecewise-constant mapping: Within each Voronoi cell, the output of self-attention is a fixed value vector, and transitions only occur at affine boundaries, formalizing the discrete routing nature of attention at low temperature.
Robustness to key normalization: Under $H$ 0-normalized keys, the partition degenerates to a standard Voronoi diagram, i.e., boundaries are only influenced by relative key positions, not their magnitudes.

Combinatorial Complexity: Sequence Length, Heads, and Depth

A major contribution is the derivation of asymptotic and constructive bounds on the number of linear regions as a function of model hyperparameters. The analysis proceeds via careful study of Newton polytopes associated with each attention head and their Minkowski sums during multi-head aggregation.

Single-Head vs Multi-Head Attention

For a single attention head, the corresponding Newton polytope is simply the convex hull of the $H$ 1 key vectors, admitting at most $H$ 2 vertices. This imposes an expressivity bottleneck.

However, with $H$ 3 heads, the output is governed by the Minkowski sum of $H$ 4 independent Newton polytopes. When these are in generic position, the combined polytope exhibits a combinatorial explosion in vertices, scaling as $H$ 5 (Figure 3):

Figure 3: Geometric rationale for Multi-Head Self-Attention via Minkowski sums; aggregation of polytopes enables an exponential increase in the number of regions.

This exponential gain demonstrates the geometric necessity and utility of MHSA for transformer expressivity, in contrast with SHA architectures.

Depth Composition and Tight Scaling Laws

Key bounds are established for deep transformers via recursive composition:

Upper Bound: With $H$ 6 layers, $H$ 7-dimensional embeddings, and $H$ 8-region capacity per attention block, the maximal region count is

$H$ 9

provided $L$ 0 and sufficient FFN width.

Constructive Lower Bound: Explicit weight constructions are shown to realize this bound via surjective, affine grid-like shattering (avoiding degeneracy by leveraging sawtooth ReLU maps and non-overlapping partitions). This establishes $L$ 1 as asymptotically tight.

This scaling is empirically confirmed via log-log plots of observed region and vertex counts (Figure 4):

Figure 4: Empirical log–log scaling: (a) linear region counts via Monte Carlo; (b) exact vertex counts for Minkowski sums, rapidly increasing with $L$ 2 and $L$ 3.

In deep, wide, multi-head transformers, the number of linear regions grows polynomially in $L$ 4 with degree $L$ 5, sharply contrasting the situation in MLPs where region complexity depends solely on depth and width.

Figure 5: Combinatorial explosion of maximal linear regions as a function of transformer depth $L$ 6; profound geometric shattering of the query space due to deep composition.

Stability of Polyhedral Structure at Finite Temperature

Although the main framework relies on the $L$ 7 limit, the study proves exponential tightness of this polyhedral partitioning for any positive $L$ 8 (finite softmax temperature). Via explicit Hessian and gradient bounds, the interiors of Voronoi-like regions are shown to retain affine behavior up to $L$ 9 corrections, and the nonlinearity is confined to narrow bands at boundaries (Figure 6):

Figure 6: Convergence of attention routing to Power Voronoi partition under decreasing softmax temperature; boundaries sharpen and regions stabilize as $\mathcal{N}$ 0.

Thus, the geometric complexity results pertain to actual, trainable transformers at typical temperature scalings (e.g., $\mathcal{N}$ 1).

Implications and Future Directions

The explicit characterization of transformer region complexity has practical implications for both architectural design and theoretical understanding:

Architectural tuning: Increasing sequence length $\mathcal{N}$ 2, head count $\mathcal{N}$ 3, or embedding dimension $\mathcal{N}$ 4 all directly amplify geometric complexity, with MHSA delivering exponential gain for a fixed parameter budget.
Theoretical insight: The correspondence between network depth and polynomial growth in region count justifies and quantifies the practical effectiveness of stacking many layers. It also identifies sufficient magnitude and alignment constraints to avoid geometrical collapse in composition.
Limitations and Open Problems: The analysis omits normalization schemes (e.g., LayerNorm) and optimization dynamics. Future work could extend tropical analysis to stochastic optimization trajectories or study post-training geometry in empirical models. Additionally, tropical descriptions of other architectures (e.g., MoE, graph transformers) provide a broad avenue for further research.

Conclusion

This paper presents a comprehensive geometric framework for transformer expressivity based on tropical geometry, establishing a concrete correspondence between self-attention and Power Voronoi diagrams, and rigorously deriving scaling laws for linear region complexity in terms of $\mathcal{N}$ 5, $\mathcal{N}$ 6, $\mathcal{N}$ 7, and $\mathcal{N}$ 8. The main claims are both theoretically tight and empirically validated for deep, multi-head architectures. These results constitute a new baseline for the study of compositional nonlinearity in neural attention models and will inform both principled architecture design and foundational theory moving forward.