Papers
Topics
Authors
Recent
Search
2000 character limit reached

Level of Semantics Tokenization (LoST)

Updated 23 March 2026
  • Level of Semantics Tokenization (LoST) is a framework that organizes tokens by a coarse-to-fine semantic hierarchy, explicitly prioritizing high-level attributes.
  • LoST employs independent semantic codebooks and specialized losses (e.g., RIDA, PAT) to ensure tokens reflect structured semantic information for efficient generative modeling.
  • Empirical results demonstrate that LoST improves performance metrics such as Recall, mIoU, and FID across domains like 3D modeling, recommendation systems, vision, time series, and biomedicine.

Level of Semantics Tokenization (LoST) is a framework and emerging paradigm for tokenizing diverse data modalities (text, vision, time series, 3D, recommendation, biomolecular contexts) according to a hierarchy of semantic granularity, as opposed to purely statistical or geometric organization. LoST aims to produce discrete token sequences where token positions and contents explicitly reflect semantic priorities—such as category-defining structure or high-level attributes—so that both autoregressive and masked generative models can exploit intrinsic semantic hierarchies during modeling and downstream tasks.

1. Foundations and Motivation

Traditional tokenization techniques—such as byte pair encoding (BPE) for language, perceptual vector quantization for vision, or level-of-detail (LoD) strategies for 3D—are effective for compression but often obscure or flatten hierarchical semantics. Such approaches produce tokens that either (a) lack semantic interpretability, (b) distribute semantic and instance-specific information non-monotonically, or (c) bloat token sequences with redundant or non-informative tokens, especially at coarse levels. This undermines both model efficiency and early-prefix interpretability in autoregressive settings.

LoST addresses these limitations by enforcing a structured, semantics-first tokenization: higher-level or more salient semantic factors are encoded into earlier tokens, while successive tokens encode progressively finer or instance-specific details. This approach supports "any-prefix" decoding, early semantic fidelity, efficient modeling, enhanced retrieval utility, and improved performance on generative and discriminative downstream tasks. The LoST design principle extends across multiple domains, including language (Berglund et al., 2023), recommendation (Wei et al., 27 Nov 2025), scientific LLMs (Zhuang et al., 27 Oct 2025), time series (Mathisen et al., 2024), vision (Zhang et al., 2024), and 3D shape modeling (Dutt et al., 18 Mar 2026).

2. Formal Definitions and Taxonomy of Semantic Levels

LoST formalism specifies and exploits multiple levels of semantic granularity in token representations, tailored to the target modality:

  • Language (BPE hierarchy):
    • LoST₀: Raw character stream (“surface form”).
    • LoST₁: Subword/character-level tokens.
    • LoST₂: Final BPE-merged tokens (subwords, words).
    • LoST₃: Full binary merge history per token, capturing composition provenance (Berglund et al., 2023).
  • 3D Shapes:
    • Tokens ordered by “semantic salience.” Early tokens reconstruct principal part structure; subsequent tokens provide category, topology, and style, followed by instance-level refinement (Dutt et al., 18 Mar 2026).
  • Recommendation Systems:
    • CoFiRec decomposes each item into a fixed K-level hierarchy (e.g., category, title, description, collaborative filtering signal), where each level is independently quantized and the token sequence preserves coarse-to-fine semantics (Wei et al., 27 Nov 2025).
  • Multilevel Context in Biomedicine:
    • LoST applies by replacing sequence-level tokens with high-level, expert-derived descriptive context (e.g., GO annotations, Pfam descriptions), treating these as “semantically loaded” context-tokens that bypass the flat, local tokenization of biopolymers (Zhuang et al., 27 Oct 2025).
  • Vision:
    • Feature Pyramid Tokenization (PAT) forms a joint multi-resolution codebook hierarchy over VLM feature pyramids, with each code forming a “meta-semantic” token at a specific abstraction scale (Zhang et al., 2024).
  • Time Series:
    • NC-VQVAE’s codebook tokens are shaped through self-supervised learning to represent both low-level waveform and high-level dynamical characteristics, imbuing discrete tokens with hierarchical temporal semantics (Mathisen et al., 2024).

3. Methodological Instantiations

3.1 Autoregressive LoST (3D, Recommender Systems)

In 3D generation (Dutt et al., 18 Mar 2026), the LoST tokenizer is implemented as a ViT-based encoder with causal masking on register tokens and nested dropout. This architecture ensures that the first k tokens encode principal semantics, as decoding is trained to reconstruct the full shape for all k-prefixes. The model employs a semantic alignment loss (RIDA) to align the relational structure of 3D latent space with a 2D DINO feature space.

In recommendation (Wei et al., 27 Nov 2025), the CoFiRec tokenizer uses per-level semantic encoders and independent codebooks for each item feature. Generative modeling proceeds as a coarse-to-fine autoregressive process, mirroring the semantics hierarchy.

3.2 Contextual and Hierarchical Tokenization

In scientific LLMs (Zhuang et al., 27 Oct 2025), LoST is achieved by transforming raw biomolecular sequences into structured, high-level tokens using external tools (InterProScan, homology search), assembling context-only representations that compactly capture biological meaning and reasoning targets.

In vision (Zhang et al., 2024), PAT alternates hard and soft (spherical) codebook assignment on feature pyramids from VLMs (e.g., CLIP, EVA-CLIP), correlating tokens with multiple levels of semantic abstraction. Parallel pixel and semantic branches, loosely coupled, encourage both perceptual and semantic fidelity.

3.3 Vector Quantization with Self-Supervision (Time Series)

NC-VQVAE (Mathisen et al., 2024) adds a self-supervised branch to VQ-VAE, optimizing tokens to simultaneously capture low-level shape and high-level dynamics. Losses such as Barlow Twins or VIbCReg guide codebooks to represent semantic variance beyond reconstruction fidelity.

4. Training Objectives and Loss Functions

LoST implementations adopt loss structures that directly promote semantics-aligned tokenization:

  • 3D Shapes (RIDA loss):
  • Vision (PAT loss):
    • Combined VQ losses at each feature pyramid level, spatial-alignment (CRF/TV) regularization, pixel-level reconstruction, and segmentation cross-entropy (Zhang et al., 2024).
  • Recommendation (CoFiRec tokenizer):
    • Reconstruction loss on each level embedding, codebook commitment, and a ranking-guided AR loss in generation (Wei et al., 27 Nov 2025).
  • Time Series:
    • VQ commitment and reconstruction losses, self-supervised invariance and redundancy reduction (e.g., Barlow Twins, VIbCReg), plus regularization to disentangle codebook tokens (Mathisen et al., 2024).
  • Contextual Biomedicine:
    • No explicit loss function; the tokenization pipeline is structured to maximize information density and semantic alignment by context selection (Zhuang et al., 27 Oct 2025).

5. Empirical Performance and Semantic Efficiency

LoST-based models consistently demonstrate substantial efficiency and semantic performance gains versus traditional, flat, or LoD-based tokenizers:

  • 3D: LoST achieves lower Chamfer Distance and FID, and higher DINO semantic similarity than octree-based or mesh-based tokenizations, despite using only 0.1%–10% as many tokens. Semantic retrieval performance (Recall@3, mAP@3, Jaccard@3) improves by over 10 percentage points using RIDA-aligned tokens (Dutt et al., 18 Mar 2026).
  • Recommendation: CoFiRec outperforms TIGER, P5, and other baselines by up to 90% in relative Recall@10/Recall@5 and NDCG@5 in multiple benchmarks, with ablation confirming the necessity of hierarchy-respecting ordering (Wei et al., 27 Nov 2025).
  • Contextual Biomedicine: Context-only input modes in Sci-LLMs yield up to 40 points higher functional QA score (LLM-Score) relative to sequence-only, with further advantages in EC number prediction and DNA mutation classification (Zhuang et al., 27 Oct 2025).
  • Vision: PAT achieves mIoU improvements of +0.78/+1.6 over baseline SAN systems (CLIP/EVA-CLIP) in open-vocabulary segmentation, with ablations attributing this gain to multi-level codebook hierarchy and cross-level fusion (Zhang et al., 2024).
  • Time Series: NC-VQVAE attains higher probe accuracy, Inception Score, and lower FID in 12–13/13 UCR benchmarks, confirming that self-supervised LoST tokens support richer generative modeling (Mathisen et al., 2024).

6. Practical Implementation and Architectural Strategies

A range of architectural and training principles underpin LoST frameworks:

  • Prefix usability: Causal masking and nested dropout (3D LoST) or levelwise AR generation (CoFiRec) ensure that token prefixes encode maximal semantic content for early-exit or anytime-use decoding.
  • Independent multi-codebooks and semantic branches: Separate codebooks for distinct semantic levels reduce code collisions and encourage disentangled representation (CoFiRec, PAT).
  • Cross-level fusion: TokenMixer or Transformer-based fusion merge information across scales or branches, enabling both pixel and semantic decoding in vision applications (Zhang et al., 2024).
  • Self-supervised regularization: Non-contrastive invariance and redundancy reduction promote codebook tokens capturing global temporal or modality-agnostic semantics (Mathisen et al., 2024).
  • RIDA and other semantic alignment methods: Latent alignment losses supervise tokens to preserve semantic proximity (not just input similarity) (Dutt et al., 18 Mar 2026).

7. Limitations, Open Challenges, and Future Directions

While LoST frameworks show significant advances, several limitations remain:

  • Representation constraints: Current implementations often rely on specific data encodings (e.g., triplanes for 3D); extending LoST to more general data representations (Gaussian splats, segmentation graphs) and multimodal scenarios is ongoing (Dutt et al., 18 Mar 2026).
  • Decoder complexity: Diffusion-based decoders (LoST for 3D) incur greater computational cost than pure AR models, motivating exploration of lighter-weight decoding strategies.
  • Adaptivity and token-budget: The AR models typically fix token sequence length; variable-length and EOS-token-augmented LoST is understudied but may further optimize efficiency.
  • Semantic granularity selection: The optimal number and placement of semantic levels remains task- and modality-dependent; guidance for principled level selection is not fully established.
  • Hybrid reasoning and tool integration: Especially for scientific LLMs, dynamic context-generator pipelines and agent-based prompting can further amplify LoST’s semantic coverage (Zhuang et al., 27 Oct 2025).

LoST establishes a principled framework for semantically-hierarchical tokenization, bridging the gap between low-level perceptual encoding and high-level semantic organization. It has demonstrated domain-general efficacy, from language and vision to time series and 3D data, with future work likely to explore adaptive, multimodal, and agent-oriented extensions.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Level of Semantics Tokenization (LoST).