The Token Games (TTG) Overview

Updated 4 July 2026

The Token Games (TTG) are a multifaceted set of frameworks that model token dynamics in diverse areas such as LLM evaluation, ledger flows, bidding mechanisms, automata theory, and percolation.
They employ concrete methodologies like Python puzzle duels, networked ledger updates, and auction-based moves to evaluate system performance and strategic exchanges.
TTG frameworks provide actionable insights into adaptive difficulty escalation, strategic resource control, and phase transitions in complex mathematical and computational systems.

“The Token Games” (TTG) is not a single uniformly defined construct across the arXiv literature. The label denotes several technically distinct formalisms in which “tokens” act as transferable records, movable markers on graphs, automata runs, or self-generated puzzles. In contemporary usage, TTG most prominently refers to an evaluation framework for LLMs in which models duel by creating and solving programming puzzles, but closely related literatures also use token games to study ledger dynamics, bidding control of graph traversal, history-determinism in automata, and percolation-style games on lattices (Henniger et al., 19 Feb 2026, Naicker, 2019, Avni et al., 2017, Prakash, 21 Jan 2025, Holroyd et al., 2015).

1. Terminological scope and major lineages

Across these literatures, the common structural idea is that a tokenized object evolves under explicit local rules, and the induced trajectory or distribution is the primary object of analysis. What changes from field to field is the ontology of the token: in one case it is an information record carrying value, in another a marker moved across a graph, in another a run of an automaton, and in another a puzzle instance proposed by a model.

Research area	Meaning of TTG	Representative paper
LLM evaluation	Puzzle-duel benchmark	(Henniger et al., 19 Feb 2026)
Tokenized ledgers	Exchange dynamics on networks	(Naicker, 2019)
Graph games	Bidding for token control	(Avni et al., 2017)
Automata theory	2-token game for HDness	(Prakash, 21 Jan 2025)
Statistical mechanics	Percolation game on lattices	(Holroyd et al., 2015)

This multiplicity matters because superficially similar vocabulary can conceal sharply different mathematical objects. In the benchmark literature, TTG is a protocol for adversarial instance generation and Elo-style evaluation. In the ledger literature, it is a linear or affine dynamical system over weighted directed graphs. In bidding games, it is a zero-sum infinite-duration game in which auction outcomes determine token motion. In automata theory, TTG is a synchronised multi-run game that characterises history-determinism. In percolation theory, it is a random-environment game whose draw behavior is linked to ergodicity and Gibbs measures.

2. TTG as a self-generating reasoning benchmark

In the 2026 sense of the term, TTG is an evaluation framework in which LLMs compete by inventing and solving programming puzzles. A puzzle is a Python verifier $f(x)\to \text{bool}$ , and the solver’s task is to find an input $x$ such that $f(x)=\text{True}$ . The framework is explicitly designed to avoid human-authored question bottlenecks, automatic grading problems, and saturation of static benchmarks; it also measures not only puzzle solving but puzzle creation, calibration, and creativity (Henniger et al., 19 Feb 2026).

A duel involves two models over a fixed number of rounds. In each round, one model is the proposer and one is the solver. The proposer outputs a Python function mystery(x) and a sample solution. If the sample solution fails under execution, the proposer immediately loses the round. If the sample solution passes, the solver receives only the function and must output an input satisfying it. If the solver succeeds, the round is a draw; if the solver fails, the proposer wins the round. The duel runs for $2R$ rounds so that each model alternates roles equally often. Outcomes across pairwise duels are aggregated with a Bradley–Terry model in Elo parametrization,

$P(A \succ B) = \frac{1}{1 + 10^{(E_B - E_A)/\sigma}},$

with $\sigma = 400$ (Henniger et al., 19 Feb 2026).

The benchmark evaluated 10 frontier models over $10\times 9 = 90$ ordered duels, with 10 rounds per duel. The resulting ranking placed GPT‑5.2 Pro first at Elo $1134$, with $100.0\%$ solver win rate and $43.3\%$ proposer win rate; Gemini 3 Pro followed at Elo $x$ 0, and several models clustered near Elo $x$ 1. GPT‑5.2 base ranked last at Elo $x$ 2 despite strong external benchmark performance, which the authors attribute to a likely mode mismatch in reasoning settings. Solver performance correlated more strongly with Humanity’s Last Exam and GPQA Diamond than proposer performance did, while proposer success exposed a distinct capability: the ability to generate valid, hard, yet solvable tasks (Henniger et al., 19 Feb 2026).

The framework’s empirical profile is especially notable. Across all models and rounds, solvers recorded 431 successes against 111 failures on valid puzzles, while proposers incurred 358 penalties from incorrect sample solutions. This establishes a central asymmetry: creating good puzzles is harder than solving them. The paper also reports that puzzle difficulty rises over the course of a duel; for independently evaluated valid puzzles, GPT‑5.2’s solve rate decreased by about $x$ 3 percentage points per turn and GPT‑5 mini’s by about $x$ 4 points per turn. This suggests that access to duel history induces adaptive difficulty escalation rather than static task generation (Henniger et al., 19 Feb 2026).

3. Token exchange games as ledger and network dynamics

In “Token Exchange Games,” tokens are information records that can be stored in some medium and allocated to agents, typically representing property rights or access rights. A token system is a collection of identical tokens plus a rule-set governing token behavior and distribution. The core mathematical object is the ledger: given a ranking $x$ 5, a ledger is

$x$ 6

and a ledger sequence $x$ 7 models discrete time. In practice, $x$ 8 with $x$ 9 interpreted as token balance (Naicker, 2019).

The network representation is built from a two-parameter family of directed graphs $f(x)=\text{True}$ 0, where $f(x)=\text{True}$ 1 indexes rounds and $f(x)=\text{True}$ 2 indexes token types. Vertex weighting $f(x)=\text{True}$ 3 assigns token holdings to agents, and edge weighting $f(x)=\text{True}$ 4 specifies token-flow fractions along edges. With vertex-weighting vector $f(x)=\text{True}$ 5 and edge-weighting matrix $f(x)=\text{True}$ 6, the closed-game update is

$f(x)=\text{True}$ 7

while the open-game update allows external issuance or removal through

$f(x)=\text{True}$ 8

Closed games preserve total supply $f(x)=\text{True}$ 9; open games model minting, burning, taxation, or external injections (Naicker, 2019).

This framework supports a multilayer interpretation. Each token type defines a layer; players possess token portfolio vectors $2R$0, and exchange rates across layers are encoded by a fungibility matrix $2R$1. The paper defines isolated layers, fungibility graph families $2R$2, local fungibility equilibrium, and information-theoretic costs $2R$3 and $2R$4 for establishing exchange and preventing arbitrage. A key structural result states that if a fungibility graph is arbitrage-minimizing, then it must either be a tree or forest, or else all arbitrage-prevention costs vanish. This imposes a graph-theoretic constraint on multilayer token economies (Naicker, 2019).

The framework also imports information-theoretic and macroeconomic observables. For a holdings vector $2R$5, normalized token ownership is $2R$6, enabling entropy $2R$7 and relative entropy $2R$8 as distributional diagnostics. Per-round retention is measured by

$2R$9

with circulation rate $P(A \succ B) = \frac{1}{1 + 10^{(E_B - E_A)/\sigma}},$ 0. In a two-token marketplace, the exchange-rate entry $P(A \succ B) = \frac{1}{1 + 10^{(E_B - E_A)/\sigma}},$ 1 is defined from circulating supplies and retention rates, and the paper identifies the equation of exchange through $P(A \succ B) = \frac{1}{1 + 10^{(E_B - E_A)/\sigma}},$ 2 in the TTG variables. Illustrative instances include PageRank, a two-player UBI-style treasury game, Lightning Network sublayers, and the Circles UBI system (Naicker, 2019).

4. Bidding token games on graphs

In bidding graph games, a token moves on a finite directed graph, but the right to choose the next edge is determined by an auction each round rather than by fixed turn order. Under Richman bidding, the higher bidder moves the token and pays the bid to the opponent; under poorman bidding, the winner pays the bank; taxman bidding interpolates between these cases via a parameter $P(A \succ B) = \frac{1}{1 + 10^{(E_B - E_A)/\sigma}},$ 3, where a fraction $P(A \succ B) = \frac{1}{1 + 10^{(E_B - E_A)/\sigma}},$ 4 is paid to the opponent and $P(A \succ B) = \frac{1}{1 + 10^{(E_B - E_A)/\sigma}},$ 5 to the bank. These games were developed first for Richman infinite-duration bidding, then extended to poorman and taxman variants (Avni et al., 2017, Avni et al., 2018, Avni et al., 2019).

A central object is the threshold budget or threshold ratio. In reachability Richman games, thresholds satisfy the local averaging law

$P(A \succ B) = \frac{1}{1 + 10^{(E_B - E_A)/\sigma}},$ 6

where $P(A \succ B) = \frac{1}{1 + 10^{(E_B - E_A)/\sigma}},$ 7 and $P(A \succ B) = \frac{1}{1 + 10^{(E_B - E_A)/\sigma}},$ 8 are successors minimizing and maximizing the threshold. In poorman reachability, the recurrence becomes

$P(A \succ B) = \frac{1}{1 + 10^{(E_B - E_A)/\sigma}},$ 9

and in taxman reachability,

$\sigma = 400$ 0

Parity games reduce to reachability-style threshold analysis, so threshold ratios exist there as well (Avni et al., 2019).

For strongly connected mean-payoff games, the theory becomes probabilistic. Richman bidding corresponds to a random-turn game with an unbiased coin, so the mean-payoff value is independent of the initial ratio. Poorman bidding instead corresponds to a biased random-turn game $\sigma = 400$ 1, where $\sigma = 400$ 2 is the initial ratio. Taxman bidding yields the effective bias

$\sigma = 400$ 3

so the strongly connected mean-payoff taxman game has the value of $\sigma = 400$ 4. This identifies Richman as the exceptional case: for every $\sigma = 400$ 5, the value depends on the initial ratio (Avni et al., 2018, Avni et al., 2019).

Discrete bidding introduces an additional combinatorial layer because bids are integral and ties occur explicitly. The 2022 work on discrete-bidding Richman games formalizes budgets in $\sigma = 400$ 6, where a starred budget marks possession of an “advantage” that resolves ties. It provides a fixed-point algorithm revealing the structure of threshold budgets for parity discrete-bidding games, an NP and coNP algorithm for finding thresholds in both reachability and parity objectives, and strategy constructions using only linear memory (Avni et al., 2022).

When budgets are only partially observable, the analysis changes sharply. In one-sided partial-information bidding games, one player knows only a distribution over the opponent’s initial budget. For qualitative first-price bidding games, threshold reasoning adapts directly. For mean-payoff poorman games, however, the partially informed player’s optimal pure strategies require phase-based budget splitting into “wallets,” and the paper proves that the pure-strategy value need not exist: in the bowtie game, the partially informed Max and the fully informed Min induce distinct lower and upper values (Avni et al., 2022).

5. Automata-theoretic token games and the 2-token theorem

In automata theory, TTG denotes a family of synchronised run-comparison games used to decide history-determinism. For a nondeterministic parity automaton $\sigma = 400$ 7, the history-determinism game lets Adam choose letters and Eve resolve nondeterministic transitions online; $\sigma = 400$ 8 is history-deterministic if Eve can always produce an accepting run whenever the word is in $\sigma = 400$ 9. The 2-token game sharpens this by giving Adam two runs and Eve one. Starting from $10\times 9 = 90$ 0, Adam chooses letters, Eve advances her token, and Adam advances both of his. Eve wins iff, whenever at least one of Adam’s two runs is accepting, her run is accepting as well (Prakash, 21 Jan 2025).

The main result is the 2-token theorem: for every nondeterministic parity automaton, Eve wins the 2-token game if and only if the automaton is history-deterministic. This proves Bagnol and Kuperberg’s conjecture beyond Büchi automata and yields a complexity improvement from the naïve EXPTIME upper bound to PTIME for fixed parity index and PSPACE in general. The thesis also proves NP-hardness when the parity index is not fixed, gives a polynomial-time determinisation procedure for history-deterministic Büchi automata, and develops quasipolynomial-time inclusion algorithms when the right-hand automaton is history-deterministic (Prakash, 21 Jan 2025).

A closely related quantitative extension studies $10\times 9 = 90$ 1-token games $10\times 9 = 90$ 2 for weighted automata, where Eve’s run must dominate the values of Adam’s $10\times 9 = 90$ 3 runs. There, $10\times 9 = 90$ 4 characterizes history-determinism for all quantitative and Boolean automata on finite words, and for DSum, Inf, and Reachability automata on infinite words; $10\times 9 = 90$ 5 characterizes history-determinism for LimInf, LimSup, and Sup automata on infinite words. The resulting complexity landscape ranges from PTime for Safety, Reachability, Inf, and Sup, to NP $10\times 9 = 90$ 6co-NP for DSum, quasipolynomial time for LimSup, and exponential time for LimInf unless the number of weights is logarithmic (Boker et al., 2021).

6. Percolation, probabilistic cellular automata, and draws

Another token-game lineage arises in percolation theory. On the oriented square lattice $10\times 9 = 90$ 7, each site is independently a trap with probability $10\times 9 = 90$ 8, a target with probability $10\times 9 = 90$ 9, or open with probability $1134$0, where $1134$1. A token starts at the origin, and the two players alternately move it to $1134$2 or $1134$3. Moving to a trap loses immediately; moving to a target wins immediately. The principal question is whether optimal play can yield draws with positive probability (Holroyd et al., 2015).

The paper encodes each starting site $1134$4 by an outcome variable $1134$5, corresponding to first-player win, first-player loss, or draw. Backward recursion along antidiagonals turns this game into a one-dimensional probabilistic cellular automaton, with a binary PCA $1134$6 when draws are absent and an envelope PCA $1134$7 over $1134$8 when draws are allowed. The central result is that if $1134$9 or $100.0\%$ 0, then $100.0\%$ 1 is ergodic and the probability of a draw on $100.0\%$ 2 is zero. Thus, despite the existence of infinite open paths in some parameter regimes, the token game almost surely has a well-defined winner under optimal play (Holroyd et al., 2015).

Higher-dimensional analogues behave differently. For certain directed graphs satisfying a layered automorphism condition, the trapping game reduces to Glauber dynamics for a hard-core model on a lower-dimensional “doubling graph.” Multiple Gibbs distributions in that hard-core model imply positive draw probability in the token game. The paper proves such draws for suitable parameter values on several higher-dimensional directed lattices, including an oriented version of the even sublattice of $100.0\%$ 3 for all $100.0\%$ 4, and conjectures analogous draw behavior for the standard oriented lattice $100.0\%$ 5 when $100.0\%$ 6, although the method does not extend there (Holroyd et al., 2015).

7. Comparative significance and open directions

Taken together, these usages place “tokens” at markedly different abstraction levels. In the LLM benchmark, the tokenized object is the puzzle duel itself, and TTG measures adversarial task generation, self-calibration, and solver competence. In token exchange games, tokens are conserved or non-conserved accounting units moving through weighted networks. In bidding games, tokens or budgets are control resources that determine which player moves a graph token. In automata TTG, tokens are synchronised runs witnessing or refuting history-determinism. In percolation games, a single token traces a path through a random environment whose outcome structure is linked to PCA ergodicity and hard-core phase transitions (Henniger et al., 19 Feb 2026, Naicker, 2019, Avni et al., 2017, Prakash, 21 Jan 2025, Holroyd et al., 2015).

The open problems are correspondingly heterogeneous. The puzzle-duel TTG explicitly points to extensions beyond Python verifiers, including other formal domains and multimodal puzzles, and to possible intersections with safety evaluation (Henniger et al., 19 Feb 2026). The automata-theoretic program leaves open the exact complexity of history-determinism for arbitrary parity automata and broader extensions to infinite-state systems (Prakash, 21 Jan 2025). In partially observable bidding games, the pure-strategy value gap raises the unresolved question of whether mixed strategies restore value existence (Avni et al., 2022). In percolation games, the standard oriented lattice $100.0\%$ 7 for $100.0\%$ 8 remains conjectural territory (Holroyd et al., 2015). The ledger formalism, finally, isolates several modeling limits—no explicit utility functions, no continuous-time treatment, and only pairwise fungibility—which suggests that further game-theoretic structure can be layered on top of the linear and affine dynamics without altering their state evolution equations (Naicker, 2019).

In that sense, TTG is best understood not as a single theory but as a family resemblance across multiple research programs. The unifying feature is the replacement of informal “state changes” by explicit tokenized dynamics—whether those dynamics describe reasoning contests, ledger flows, auctions for control, acceptance witnesses, or motion through random media.