Direct Tokenisation Optimization

Updated 21 November 2025

Direct tokenisation is a method that globally optimizes the mapping of input data to discrete tokens, contrasting with incremental approaches like BPE.
It employs global criteria to minimize overall compressed length or maximize information alignment, a problem shown to be NP-complete with no PTAS available.
This approach enhances symbolic reasoning and multimodal applications, improving linguistic segmentation, structured data encoding, and image representation.

Direct tokenisation is a family of algorithmic approaches in which the mapping from original input data (text, bytes, images, structured records, or assets) to a discrete token sequence is established by explicit, usually global, criteria rather than incremental or greedy procedures. Unlike bottom-up schemes such as Byte-Pair Encoding (BPE), which iteratively merges frequent pairs, direct tokenisation aims to generate a token vocabulary and segmentation strategy by globally optimizing a task-relevant objective, typically minimizing the compressed representation length or maximizing information alignment. Direct tokenisation has achieved particular significance in language modeling, multi-modal foundation models, structured data representation, and tokenization of non-textual assets, and is central to discussions of computational intractability, linguistic alignment, and the fundamental constraints on symbolic reasoning in neural architectures.

1. Formal Definitions and Core Objectives

Let $\Sigma$ denote a finite alphabet and $\mathcal{D}\subseteq\Sigma^*$ a finite multiset (dataset) of strings. Direct tokenisation seeks a vocabulary $\mathcal{S}\subseteq\Sigma^+$ of size $|\mathcal{S}|=|\Sigma|+\delta$ (all single characters plus $\delta$ multi-symbol strings) and a deterministic encoder $\mathrm{tok}_{\!\!{\uparrow}[\mathcal S]}$ mapping $x\in\Sigma^*$ to a minimal-length sequence of tokens $(s_1,\ldots,s_k)\in\mathcal{S}^*$ such that $x = \mathop{\mathrm{concat}}(s_1,\ldots,s_k)$ . The canonical objective is minimization of the total compressed length over the dataset: $\mathcal S^* = \arg\min_{\substack{\mathcal S\subseteq\Sigma^+\|\mathcal S|=|\Sigma|+\delta}} \sum_{\mathbf x\in\mathcal D} \left|\mathrm{tok}_{\!\!{\uparrow}[\mathcal S]}(\mathbf x)\right|$ The corresponding decision problem asks: for given $(\mathcal D, \delta, T)$ , does there exist such an $\mathcal{S}$ with total length $\leq T$ ? This compressed-length minimization, for both optimization and decision forms, defines direct tokenisation in the formal sense (Whittington et al., 2024, Kastreva et al., 19 Nov 2025). For images, continuous or structured data, and multimodal scenarios, direct tokenisation generalizes to finding token assignments that optimize perceptual loss, semantic consistency, or other modality-specific objectives (Yu et al., 2024, Hou et al., 17 Nov 2025, Karim et al., 3 Aug 2025).

2. Computational Complexity: NP-Completeness and Inapproximability

Multiple independent works have rigorously proven that direct tokenisation as stated above is NP-complete, even under strong constraints:

For arbitrary finite alphabets, the direct-tokenisation decision problem lies in NP and is NP-hard via a polynomial-time reduction from Max-2-SAT (Whittington et al., 2024).
The problem remains NP-complete for bounded alphabets, including binary ( $n=2$ ) and even unary ( $n=1$ ) alphabets, establishing that the intractability is not an artifact of large alphabets or complex symbol sets (Kastreva et al., 19 Nov 2025).
No polynomial-time approximation scheme (PTAS) exists (unless $\mathrm{P}=\mathrm{NP}$ ): there exists $\rho>1$ such that any polynomial-time algorithm cannot guarantee a vocabulary within $\rho$ times the minimal compression (Kastreva et al., 19 Nov 2025).
In practice, this intractability explains why real-world tokenisers such as BPE and UnigramLM employ greedy or heuristic approaches; exact global optimization is computationally prohibitive even for small or extremely simple datasets (Whittington et al., 2024, Kastreva et al., 19 Nov 2025).

3. Algorithmic Paradigms and Direct Tokenisation Variants

Direct tokenisation encompasses a spectrum of algorithmic approaches:

Variant / Domain	Objective	Representative Algorithms
Text, sequence	Minimum-length tokenisation	Brute-force DP, unigram/BPE (heuristic)
Bytes	Low-surprisal LM-based span detection	ByteSpan (Goriely et al., 23 Jun 2025)
Structured/tabular	Structure+value alignment with subword merges	Hybrid direct+BPE pipeline (Karim et al., 3 Aug 2025)
Multimodal/Embedding	Semantic distance, quantization error minimization	Mixture-of-Experts quantization (Hou et al., 17 Nov 2025)
Images	Latent semantic representation, reconstruction loss	1-D transformer quantizer (TiTok) (Yu et al., 2024)

ByteSpan is a direct method that identifies segments of bytes with low surprisal under an external byte-level LLM, using either global thresholds or monotonicity constraints, then directly compiles frequent spans into subwords (Goriely et al., 23 Jun 2025).

Hybrid tabular schemes map structural markers and low-cardinality features to fixed tokens, and apply direct subword discovery (e.g., BPE trained for each column) to high-cardinality or numeric columns, enabling sequence-based ingestion by LLMs (Karim et al., 3 Aug 2025).

Multidomain embedding tokenisation (as in UniTok) trains a single tokeniser to project continuous item representations into a shared latent space, then directly assigns discrete tokens via domain-conditional quantizer architectures with mutual information calibration (Hou et al., 17 Nov 2025).

4. Impact on Symbolic and Semantic Processing

Direct (atomic) tokenisation provides crucial control over the granularity and interpretability of downstream model operations:

Symbolic and arithmetic reasoning in LLMs is fundamentally limited by token granularity. Under standard BPE, frequent subword merges obscure atomic units (e.g., digits, letters), hindering tasks such as counting, sorting, and manipulation of symbol sequences (Zhang et al., 20 May 2025).
Adopting a direct tokenisation scheme aligned to atomic units (e.g., character-level, delimiter-forced tokenisation) restores token awareness, enabling small LLMs to outperform much larger ones on fine-grained symbolic computations; observed gains often exceed 50 percentage points in task accuracy for counting and sorting (Zhang et al., 20 May 2025).
Similarly, in structured data, direct mapping of structural elements avoids the semantic ambiguity and sharing loss associated with purely frequency-based BPE merges (Karim et al., 3 Aug 2025).
For cross-domain recommendation and multi-modal models, direct tokenisation enables generalization, parameter efficiency, and balanced semantic calibration otherwise unattainable with per-domain or greedy tokenisers (Hou et al., 17 Nov 2025).

5. Morphological and Linguistic Alignment

Direct tokenisation methods are central to improving morphological segmentation and token-level linguistic interpretability:

ByteSpan, by grouping locally predictable bytes per LM surprisal, achieves higher morphological alignment scores (F1 up to 0.885 at 16k vocab, monotonic constraint) than BPE and even BPE-inferred WordPiece segmentation, while slightly increasing average token-per-word fertility (Goriely et al., 23 Jun 2025).
Explicit treatment of linguistic boundaries, such as by always tokenizing spaces as dedicated tokens and forbidding their inclusion in subwords, significantly improves morphological segmentation metrics and derivational task accuracy (up to +2 F1 points and +2 percentage points in accuracy on derivational tasks), with no degradation in core NLU benchmarks (Gow-Smith et al., 2022).
Such direct, linguistically-motivated tokenisation can mitigate vocabulary degeneracy, representation inconsistency, and poor handling of complex morphemes caused by subword Merges that entangle space or cursor position (Gow-Smith et al., 2022).

6. Modality-Driven Direct Tokenisation Beyond Text

The direct tokenisation paradigm generalizes natively to non-sequential or non-text domains:

In image generation, TiTok utilizes a Transformer-based encoder with learnable latent tokens, producing a fixed-length, one-dimensional sequence of quantized latents that directly represent the image. At 256×256 resolution, TiTok achieves extreme compression (32 tokens/image), with gFID 1.97 and up to 410× faster generation compared to diffusion models (Yu et al., 2024).
Direct tokenisation of structured data employs hybrid mappings, tokenizing both structural schema markers and continuous fields using frequency-optimized subwords, yielding high compression ration (6.18:1) while preserving relational semantics (Karim et al., 3 Aug 2025).
In recommender systems, direct mapping from continuous multimodal embeddings into a unified, cross-domain discrete token space provides parameter sharing, robust zero-shot performance, and analytic guarantees on entropy and semantic consistency (Hou et al., 17 Nov 2025).

7. Practical Implications and Future Research Directions

Given the provable intractability of direct tokenisation under conventional objectives:

All known efficient algorithms (BPE, UnigramLM, hybrid pipelines) are necessarily greedy, approximate, or heuristic (Whittington et al., 2024, Kastreva et al., 19 Nov 2025).
Research increasingly focuses on constant-factor approximation schemes and classes of inputs (e.g., bounded repeat patterns, bounded token set sizes) where direct tokenisation may admit tractable relaxations (Kastreva et al., 19 Nov 2025).
In practice, the tradeoff between linguistic alignment, information efficiency, and computational feasibility dominates model selection and tokenizer construction.
Promising areas for further exploration include adaptive or information-driven direct segmentation (as in ByteSpan), joint optimization of tokenisers with model objectives, and extension to streaming, modular, or distributed data settings (Goriely et al., 23 Jun 2025, Hou et al., 17 Nov 2025).
Modalities requiring direct symbolic reasoning, formal manipulation, or atomically-structured representations (e.g., code, mathematics, certain program synthesis tasks) benefit disproportionately from direct or atomic tokenisation (Zhang et al., 20 May 2025).

Direct tokenisation thus describes a spectrum of globally-optimized vocabularization approaches foundational to modern neural modeling, with ongoing theoretical, algorithmic, and applied significance across textual, structured, and multimodal domains.