Hierarchical Quantization & Tokenization

Updated 21 April 2026

Hierarchical quantization and tokenization are techniques that decompose complex data into multi-level discrete tokens using residual quantization methods.
They employ Euclidean and hyperbolic geometries, adaptive gating, and parallel streams to capture both coarse and fine-grained structures.
Applications span language, vision, speech, and neuroscience, enhancing modeling efficiency, interpretability, and scalability.

Hierarchical quantization and tokenization encompass a family of techniques that map complex, high-dimensional, or structured data into sequences of discrete codes (tokens) across multiple semantic or structural levels. Central to this paradigm are multi-stage quantization methods—most notably Residual Vector Quantization (RVQ) and its geometric or task-adaptive extensions—which are foundational for efficient, interpretable, and information-preserving symbolic representations across modalities including language, graphs, audio, and neuroscience data.

1. Principles of Hierarchical Quantization

Hierarchical quantization decomposes a data point into successive approximations by quantizing the residual error at each level. In formal terms, given an initial vector $z \in \mathbb{R}^d$ , a stack of $L$ quantization codebooks $\{\mathcal{C}_\ell\}_{\ell=1}^L$ is used. At each stage $\ell$ , the quantizer selects a codeword $\bm{e}_{\ell,c_\ell}$ that best reconstructs the current residual $r_{\ell-1}$ : $c_\ell = \arg\min_{i}\left\| r_{\ell-1} - \bm{e}_{\ell,i} \right\|^2, \quad r_\ell = r_{\ell-1} - \bm{e}_{\ell,c_\ell}$ The output is a sequence of tokens $[c_1, c_2, \dots, c_L]$ whose codewords sum to approximate $z$ : $\hat{z} = \sum_{\ell=1}^L \bm{e}_{\ell,c_\ell}$ Each stage refines the representation, capturing successively finer semantic or structural distinctions (Piękos et al., 18 May 2025, Wang et al., 2024, Pang et al., 31 Dec 2025).

2. Core Hierarchical Tokenization Architectures

a. Euclidean and Hyperbolic RVQ

Standard RVQ operates in Euclidean space and forms the backbone of numerous tokenizers for language, vision, speech, and graphs. However, for inherently hierarchical data (e.g., taxonomies), encoding in Euclidean space distorts structure, as Euclidean volume grows only polynomially with distance from the origin—insufficient to embed tree-like expansions.

Hyperbolic Residual Quantization (HRQ) generalizes RVQ to hyperbolic space $L$ 0, using hyperbolic distance $L$ 1, Möbius addition for codeword aggregation, and logarithmic/exponential maps for residual computation. This imparts an exponential volume growth prior, yielding topologically faithful, branch-wise token stratification (Piękos et al., 18 May 2025).

b. Task-Adaptive Gated Multi-Scale RVQ

QUIET introduces a multiple-scale RVQ stack with a frozen encoder. An adaptive gating layer (MLP+softmax) modulates the contribution from each quantization scale per-task, allowing for efficient downstream specialization with negligible parameter overhead (Xiang et al., 14 Oct 2025).

c. Factorized, Parallel Hierarchies

In speech and music, token hierarchies may be parallel rather than sequential. HAC factorizes tokens into parallel streams corresponding to different linguistic levels (acoustic, phonetic, lexical), with each stream produced by a dedicated quantizer and optional downstream Transformer for context (Khurana et al., 18 Jun 2025). HAFM combines a coarse-to-fine codebook stack for acoustic tokens with a parallel semantic token stream, enabling multi-rate, semantically aligned representations (Zhu et al., 10 Apr 2026).

d. Discrete Tokenization for Structured Data

GQT applies RVQ to precomputed graph node embeddings, producing a discrete multi-token address per node. By training the tokenizer via self-supervision and then freezing, tokens represent stable, transferable graph features (Wang et al., 2024). HST applies a hierarchical and feedback-refined VQ-VAE to spatio-temporal state and transition encodings of brain fMRI time series, yielding interpretable "state" and "transition" tokens (Yang et al., 28 Jun 2025).

3. Hierarchical Tokenization Algorithms and Pseudocode

Across domains, the fundamental algorithm can be summarized as follows:

$L$ 4 For HRQ, metric and operations are replaced by their hyperbolic counterparts (Möbius addition/subtraction, logarithmic mapping) (Piękos et al., 18 May 2025). In dual- or multi-stream cases, different encoders and quantizers operate in parallel, optionally guided by knowledge distillation or contrastive objectives (Khurana et al., 18 Jun 2025, Pang et al., 31 Dec 2025).

4. Hierarchical Token Semantics and Interpretability

In all hierarchical quantization systems, each token sequence $L$ 2 is interpretable as a path or address in a conceptual code-tree:

Initial levels: Coarse global/topological or semantic clusters (e.g., root of taxonomy, macro-acoustic/semantic cues, community membership in graphs).
Subsequent levels: Finer subdivisions, refining within the parent context (e.g., word or phoneme within an utterance, fine structure in brain dynamics).
Parallel hierarchies: Assign orthogonal axes for distinct semantic facets (e.g., acoustic/phonetic/lexical).

Notably, HRQ in hyperbolic space arranges most generic (parent) tokens near the ball center, branching outward, inducing an exponential volume prior that matches data with latent hierarchies (Piękos et al., 18 May 2025). In HiGR and LETTER, prefix tokens can be manipulated for controlled generation or diversity, thanks to prefix-aligned contrastive losses (Pang et al., 31 Dec 2025, Wang et al., 2024).

5. Training Objectives and Loss Formulations

Reconstruction: Standard MSE or cross-entropy between reconstructed and input signal (Piękos et al., 18 May 2025, Khurana et al., 18 Jun 2025). Commitment: Encourages encoder output to stay close to codevectors, typically via stop-gradient variants: $L$ 3 Contrastive/Alignment/Self-supervision: Enforces semantic separation, collaborative filtering consistency, or disentanglement of parallel branches (Wang et al., 2024, Pang et al., 31 Dec 2025, Khurana et al., 18 Jun 2025).

Knowledge Distillation: For factorized codecs, phonetic and lexical token streams can be supervised by matching pre-existing encoders (e.g., HuBERT, LaBSE in HAC (Khurana et al., 18 Jun 2025)).

Diversity regularization: Penalizes dominance by a small subset of codes, using cluster-based spread terms (Wang et al., 2024).

Hierarchy-specific regularization: HRQ ensures that code assignments reflect underlying latent tree-shapes via the geometry of the manifold (Piękos et al., 18 May 2025).

6. Empirical Findings and Applications

The following results illustrate typical benefits:

Task	Hierarchical Quantization Scheme	Reported Improvement	Reference
WordNet hierarchy modeling	HRQ vs. Euclidean RQ	+20% Recall@10	(Piękos et al., 18 May 2025)
Speech token disentanglement	HAC (acoustic/phonetic/lexical)	Highest PNMI, F1 in phonetic/lexical, strong SI-SDR	(Khurana et al., 18 Jun 2025)
Musical accompaniment synthesis	HAFM (semantic+acoustic hierarchy)	FAD = 2.08, outperforming retrieval/SoTA	(Zhu et al., 10 Apr 2026)
Graph node classification	QUIET (multi-scale RVQ+gate)	+3.7% ACC on Corafull, +1.75% MRR on Pubmed	(Xiang et al., 14 Oct 2025)
Recommendation (item ID generation)	HiGR/LETTER (residual quantization)	+10% offline, +1.22% watch time (HiGR)	(Pang et al., 31 Dec 2025)
fMRI state identification	HST (hierarchical VQ-VAE)	Highest accuracy/lowest MSE vs. VAE/VQ-VAE	(Yang et al., 28 Jun 2025)
Language modeling (tokenization)	Hierarchical BPE patching	Best BPB and parameter/memory efficiency	(Dolga et al., 17 Oct 2025)

These gains are attributed to the token hierarchy better matching the data's natural structure, enabling compact codebooks, greater interpretability, task adaptivity (via per-task gating or prefix manipulation), diversity, and downstream model throughput.

7. Perspectives and Theoretical Considerations

Hierarchical quantization enables efficient symbolic summarization with the following salient properties:

Expressivity: Multilevel codebooks model data at different granularities, supporting compact yet information-rich representations.
Geometric alignment: Proper choice of ambient geometry (Euclidean vs. hyperbolic) can substantially improve structure preservation, particularly for trees or graphs (Piękos et al., 18 May 2025).
Disentanglement and Factorization: Parallel codebooks support semantic disentanglement (e.g., speech: acoustic/phonetic/lexical), which is directly measured by mutual information or ablation (Khurana et al., 18 Jun 2025).
Efficiency and Memory: Hierarchical tokenization reduces data to a small number of integers per instance, facilitating large-scale modeling with massive reductions in memory and compute requirements (Wang et al., 2024, Dolga et al., 17 Oct 2025).
Adaptivity and Control: Self-weighted gates (QUIET), prefix-diversity control (HiGR, LETTER), and token modulation allow the representation to be dynamically adapted to task constraints or diversity requirements.
Theoretical optimality: For data with tree-like growth, hyperbolic quantization is near-isometric and volume-matching, mitigating the bottleneck of Euclidean embeddings (Piękos et al., 18 May 2025).

A plausible implication is that hierarchical quantization and tokenization schemes will continue to underpin next-generation foundation models in domains where high-level semantics and multi-scale structure must be efficiently and interpretably captured. Future work is expected to further refine geometric alignment, incorporate multimodal self-supervision and distillation, and extend dynamic gating or control to a wider range of domains.