Random Token Selection (RTS)

Updated 7 April 2026

Random Token Selection (RTS) is a statistical mechanism that selects a random subset of tokens for computation, optimizing efficiency and privacy.
RTS employs methods like Boltzmann Draw, token substitution in transformers, and unbiased sampling in reinforcement learning to reduce computational costs.
RTS is applied in digital payments, self-supervised NLP, and RL, achieving resource savings and performance parity with traditional deterministic approaches.

Random Token Selection (RTS) refers to algorithmic mechanisms and training objectives where a subset of tokens, either from input data or generated sequences, is selected at random—typically under a well-defined statistical protocol—for downstream computation, optimization, or learning. RTS has become a foundational tool in diverse settings such as coin selection in digital payment systems, self-supervised objectives in transformer pre-training, and scalable reinforcement learning for long chain-of-thought trajectories. The precise formulation, application, and theoretical guarantees of RTS vary by context but generally exploit statistical unbiasedness, efficiency, or privacy advantages relative to deterministic or exhaustive strategies.

1. Formulations of Random Token Selection

RTS encompasses multiple formally distinct methodologies across domains:

Probabilistic subset selection for transaction funding: In token-based payment systems, RTS arises via the Random Draw (RD) scheme or, more generally, the Boltzmann Draw (BD). Given a wallet $U=\{u_1,\dots,u_n\}$ and a transaction value $V$ , the BD selects tokens without replacement using:

$P(u) = \frac{\exp(-\beta v(u))}{Z}, \quad Z = \sum_{w \in U\setminus S} \exp(-\beta v(w))$

where $v(u)$ is the value of $u$ , $S$ is the current selection, and $\beta$ tunes selection bias toward small or large tokens. For $\beta=0$ , this recovers uniform RTS (standard RD) (Bönsel et al., 19 Feb 2026).

Random token corruption in transformer objectives: In self-supervised NLP, Random Token Substitution (RTS) perturbs a sequence $x=(x_1,\dots,x_n)$ via a binary mask $m\sim \text{Bernoulli}(p)$ per token: $V$ 0 is replaced with a random token $V$ 1, where $V$ 2 is typically uniform over the vocabulary. The downstream objective is to classify each token as original or replaced using a binary head (Liello, 2023).
Partial token selection in RL policy gradients: For long-sequence reinforcement learning, Not All Tokens Are Needed (NAT) applies RTS via Uniform Random Sampling (URS) of the policy gradient’s per-token terms, using Horvitz–Thompson (HT) unbiased reweighting:

$V$ 3

$V$ 4, $V$ 5 is the retention probability (Sang et al., 20 Feb 2026).

2. Algorithmic Procedures and Complexity

Algorithmic instantiations of RTS are tailored by use case:

Boltzmann Draw (BD): Iteratively sample tokens from $V$ 6 using BD probabilities; update weights, select, and continue until the cumulative value meets $V$ 7. Complexity per draw is $V$ 8 for weight computation and $V$ 9 for sampling. Practical total cost per transaction is $P(u) = \frac{\exp(-\beta v(u))}{Z}, \quad Z = \sum_{w \in U\setminus S} \exp(-\beta v(w))$ 0, with $P(u) = \frac{\exp(-\beta v(u))}{Z}, \quad Z = \sum_{w \in U\setminus S} \exp(-\beta v(w))$ 1 the typical number of used tokens, and advanced data structures (segment tree, value-binning) yield further gains (Bönsel et al., 19 Feb 2026).
Transformer RTS pre-training: The mask sampling and token replacement incur negligible overhead. The binary classification prediction head is $P(u) = \frac{\exp(-\beta v(u))}{Z}, \quad Z = \sum_{w \in U\setminus S} \exp(-\beta v(w))$ 2 per position, improving batch throughput and lowering memory requirements relative to MLM’s $P(u) = \frac{\exp(-\beta v(u))}{Z}, \quad Z = \sum_{w \in U\setminus S} \exp(-\beta v(w))$ 3 head (Liello, 2023).
RL RTS (NAT framework): In URS, randomly mask tokens in each rollout, apply HT reweighting, and backpropagate only through kept tokens, reducing backward pass complexity linearly in token retention fraction $P(u) = \frac{\exp(-\beta v(u))}{Z}, \quad Z = \sum_{w \in U\setminus S} \exp(-\beta v(w))$ 4. With Random Prefix Cutting (RPC), a randomly chosen prefix $P(u) = \frac{\exp(-\beta v(u))}{Z}, \quad Z = \sum_{w \in U\setminus S} \exp(-\beta v(w))$ 5 is used; only tokens up to $P(u) = \frac{\exp(-\beta v(u))}{Z}, \quad Z = \sum_{w \in U\setminus S} \exp(-\beta v(w))$ 6 are processed, yielding quadratic reductions $P(u) = \frac{\exp(-\beta v(u))}{Z}, \quad Z = \sum_{w \in U\setminus S} \exp(-\beta v(w))$ 7 in both forward and backward steps. Peak activation memory and compute scale with average prefix length (Sang et al., 20 Feb 2026).

3. Theoretical Properties and Guarantees

RTS schemes are constructed for statistical correctness and efficiency:

Unbiased estimation: In RL, the HT correction ensures gradient unbiasedness:

$P(u) = \frac{\exp(-\beta v(u))}{Z}, \quad Z = \sum_{w \in U\setminus S} \exp(-\beta v(w))$ 8

enabling safe subsampling of tokens without altering expected policy updates (Sang et al., 20 Feb 2026).

Privacy and robustness: In coin selection, probabilistic draws render observer inference of wallet content imprecise, leaking only the expected value parameter $P(u) = \frac{\exp(-\beta v(u))}{Z}, \quad Z = \sum_{w \in U\setminus S} \exp(-\beta v(w))$ 9 rather than full token inventory; this is strictly less information than deterministic greedy selection (Bönsel et al., 19 Feb 2026).
Variance/efficiency tradeoffs: RTS introduces variance inflation proportional to $v(u)$ 0, where $v(u)$ 1 is the token retention probability; selection rates must be chosen to balance compute savings against stochastic gradient noise (Sang et al., 20 Feb 2026).
Practical equivalence: In pre-training, RTS with binary heads achieves downstream accuracy matching MLM within statistical fluctuations, but with ~20–45% less wall-clock time (Liello, 2023).

4. Empirical Results and Comparative Metrics

Experimental studies demonstrate the efficacy of RTS:

Domain / Metric	Baseline	RTS Variant	Key Results
Coin selection: Wallet	RD (Random)	BD (RTS: $v(u)$ 2)	BD achieves pool sizes $v(u)$ 3 vs RD’s $v(u)$ 4; matches greedy on pool and input count, but with better value-diversity (Bönsel et al., 19 Feb 2026)
Transformer pre-training	MLM (BERT)	RTS (uniform/C-RTS)	GLUE score difference $v(u)$ 5; RTS $v(u)$ 620% faster (base), up to 45% (small) (Liello, 2023)
RL (MATH,AIME24/25)	Full token	RPC (prefix cut)	Accuracy parity ( $v(u)$ 7 diff) using 54% tokens, GPU memory –18%, step time –29% (Sang et al., 20 Feb 2026)

In all cases, RTS maintains core task performance while offering significant resource or privacy advantages over deterministic or full-exhaustive approaches.

5. Key Applications

RTS is an enabling mechanism in the following contexts:

Cryptocurrencies/CBDCs: RTS via BD provides efficient, privacy-preserving selection of tokens for funding payments, bounding wallet size, controlling dust, and enhancing concurrency on ledger nodes (Bönsel et al., 19 Feb 2026).
Unsupervised/self-supervised NLP: RTS (token substitution) offers an efficient, fully token-internal pre-training alternative to [MASK] denoising, mitigating vocabulary shift between pre-training and fine-tuning, and reducing computational overhead (Liello, 2023).
Efficient reinforcement learning: NAT-style RTS enables RL on long sequence generation tasks by making token budget and training cost explicit control variables, allowing scaling to longer trajectories at fixed resource usage (Sang et al., 20 Feb 2026).

6. Implementation Considerations and Open Challenges

Adopting RTS in practice involves several technical considerations:

Efficient sampling: Use look-up tables or hardware FPU for exponentials in BD; employ alias methods or segment trees for logarithmic-time drawing (Bönsel et al., 19 Feb 2026).
Head architecture in transformer objectives: RTS benefits from shared-embedding and lightweight binary heads (Liello, 2023).
Gradient management in RL: HT reweighting is unbiased but increases gradient variance; one can tune $v(u)$ 8 to manage the speed–quality–variance frontier (Sang et al., 20 Feb 2026).
Privacy and fingerprinting: RTS variants (BD in payments, token substitution in NLP) deliberately obfuscate deterministic structure, enhancing user privacy over baseline greedy or masked approaches (Bönsel et al., 19 Feb 2026, Liello, 2023).
Hyperparameter tuning: Substitution rates, selection probability $v(u)$ 9, prefix cutoff distributions, and β must be chosen empirically for optimal trade-offs among performance, memory, time, and privacy objectives.

Open questions include optimal, possibly history-adaptive β schedules in BD, generalized penalty functions for token value, principled selection probability schedules for minimal RL variance at fixed budget, and formal bounds on the growth and efficiency of token pools and input sets across varying statistical regimes (Bönsel et al., 19 Feb 2026, Sang et al., 20 Feb 2026).

7. Connections and Significance

RTS methodologies unify and extend approaches in coin selection, pre-training, and efficient RL by turning token-level stochasticity into a design primitive with robust theoretical properties. Their broad adoption demonstrates the utility of statistically grounded subset selection—balancing efficiency, privacy, and learning efficacy—in modern scalable machine learning and distributed systems (Bönsel et al., 19 Feb 2026, Liello, 2023, Sang et al., 20 Feb 2026).