- The paper demonstrates that tokenization choices critically determine a transformer's ability to capture both local and global graph properties.
- It rigorously compares adjacency, spectral, and random-walk tokenizations using theoretical bounds and empirical validations on expressivity and computability.
- The findings emphasize inherent trade-offs and suggest hybrid tokenization strategies to balance performance, computational cost, and optimization challenges.
Introduction
"Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers" (2605.22471) provides a rigorous analysis of how input representations—specifically, graph tokenizations—fundamentally shape the expressivity and computational tractability of transformers on graph-structured data. Unlike prior works, which treat tokenization as largely a design heuristic, this paper formalizes it as a core modeling choice that can induce strict lower bounds, separation theorems, and inherent bottlenecks in transformer-based graph learning.
The manuscript systematically examines spectral, random-walk, and adjacency-based tokenizations, exposes unavoidable trade-offs between locality, globality, and lossiness, and supplies precise bounds—both complexity-theoretic and empirical—on the limitations each representation imposes. This treatment situates tokenization as a first-order axis of architectural expressivity, akin to depth, width, or attention mechanism, with significant implications for both theoretical understanding and practical design of graph transformers.
Graph Tokenizations: Definitions and Properties
Adjacency Tokenization
Adjacency tokenization represents each node by its full adjacency row. This method is lossless with respect to the graph's structure and is optimal for exposing local information such as edge configurations. Its principal drawback is the token dimension scaling linearly with the graph size, and deep architectures are required to model global properties due to the locality of the representation. Projected (truncated) variants, which reduce dimensionality via random projections, induce lossiness and can mask structural features.
Spectral Tokenization
Spectral tokenization embeds each node in the eigenbasis of the graph Laplacian, directly exposing global topological features (eigenvectors, eigenvalues). This method is lossless only when the full spectrum is provided; truncated spectral tokenizations offer global but lossy summaries. Notably, spectral tokens are mathematically ill-conditioned for strictly local queries—such as edge existence—inducing problematic optimization characteristics in transformer models.
Random-Walk Tokenization
Random-walk tokenization encodes each node by its return probabilities from random walks of varying lengths. This scheme is computationally efficient for certain global or diffusion-related properties and allows some global behavior to be accessed in shallow architectures. However, it is intrinsically lossy: there exist non-isomorphic (even non-planar vs. planar) graphs with identical random-walk statistics, making critical global distinctions unrecoverable regardless of model depth or size.
Theoretical Contributions and Expressivity Separation
The paper establishes strong complexity-theoretic lower bounds and impossibility results regarding the expressivity of transformers under different tokenizations:
- Depth separation: Certain tasks can be realized in constant depth under one tokenization but require exponential depth in another. For example, k-closed-walk detection is O(1) depth with random-walk tokens but requires double-exponential depth (22logk) with adjacency tokens, under standard circuit complexity assumptions.
- Random-walk lossiness: It is proven that random-walk tokenizations—regardless of walk length—cannot resolve global topological properties such as planarity. Using Godsil-McKay switching, the authors construct graph pairs (planar and non-planar) indistinguishable to any transformer operating on random-walk tokens.
- Spectral truncation fragility: Omitting even a single eigenvalue from the Laplacian spectrum makes triangle counting impossible for transformers, as demonstrated via construction of graphs with identical partial spectra but different subgraph counts.
- Adjacency vs. Laplacian dichotomy: Adjacency tokenization allows local tasks (like edge prediction or clique membership) to be solved efficiently but imposes exponential depth barriers for global properties (e.g., connectivity). Conversely, spectral tokenization enables constant-depth global computations but is ill-conditioned for local queries, requiring transformer parameters to scale exponentially with the graph degree to accurately decode local edge states.
- Impossibility of efficient inter-tokenization conversion: For finite-depth transformers, conversion between tokenization families is shown to be either impossible or to require exponential depth. This precludes the model from "recovering" a better representation internally if the initial tokenization is not well-suited to the given task.
- Triangle counting fails under spectrum truncation (Theorem 3): Any transformer utilizing less than the full spectrum (either largest or smallest k<n−1 eigenvalues) cannot count triangles, as key spectral signatures required for differentiating triangle-containing graphs are lost.
- Planarity undecidability for random-walk tokens (Theorem 2): For all walk lengths t(n), random-walk tokenizations fail to distinguish between graphs with critical topological differences.
- Edge prediction with spectral tokens (Theorem 5): The parameter norm of the model must grow at least linearly with the maximal degree dmax, or else the model will encounter softmax saturation and vanishing gradients.
- Adjacency depth lower bound for connectivity (Theorem 4): Detecting undirected graph connectivity requires 22logn transformer layers with adjacency tokens under standard circuit complexity conjectures.
Empirical Evaluation
Empirical results on both synthetic and real-world datasets closely validate the theoretical predictions:
- Local tasks (e.g., maximum clique membership, node ordering) are best solved using adjacency-based tokenization due to their reliance on edge-local information.
- Global tasks (e.g., molecular property prediction, connectivity) show superior performance with spectral tokenization, consistent with the theoretical advantage for global structure.
- Combined tokenizations consistently outperform single-token approaches on aggregate metrics, highlighting the complementarity of different structural views.
- Shallow transformers with Laplacian tokens can efficiently solve global graph problems regardless of graph size, while transformers with adjacency tokens fail to scale on the same tasks.
Practical and Theoretical Implications
The paper's findings have several direct implications:
- Tokenization is an architectural hyperparameter: It directly constrains which graph properties are efficiently accessible at finite depth and width, similar to the impact of activation function or attention scheme selection.
- No universal best tokenization: There is a strict trade-off between local and global expressivity, brittleness to dimensional reduction, and optimization tractability. Thus, optimal performance in applied scenarios requires careful tokenization-task alignment or hybrid tokenization schemes.
- Optimization ill-conditioning is fundamental for some representations: Some tokenizations (notably, spectral) while sufficient for expressivity, are harder to train in practice for certain queries due to weight norm scaling and gradient pathologies.
- Depth is not always a panacea: Some conversion or decoding tasks induce exponential parameter growth with depth; width or numerical precision constraints can also become critical bottlenecks.
Future Directions
The work raises several avenues for further research:
- Edge-level and hybrid tokenizations: Extending the theory to richer representations—including edge or subgraph features commonly used in molecular applications—may yield additional separation results or suggest novel architectures.
- Optimization-aware tokenization design: There is value in systematically studying the trainability and numerical stability of tokenizations, beyond their information-theoretic sufficiency.
- Nonlinear transformative architectures: Whether nonlinear or higher-precision architectures can circumvent some of the depth lower bounds remains open. Exploring width/depth trade-offs in such settings could provide further practical guidance.
Conclusion
The paper rigorously demonstrates that graph tokenization for transformers is a foundational modeling component with deep implications for both provable expressivity and actual trainability. There is a strict local-global dichotomy, lossy and brittle representations cannot generally be recovered or compensated by architectural depth, and hybrid tokenization strategies are empirically validated as beneficial. Going forward, both theoretical developments and practical applications must treat tokenization as a primary axis of model design for graph transformers.