Papers
Topics
Authors
Recent
2000 character limit reached

Semantic-Guided Hierarchical Codebooks

Updated 2 January 2026
  • Semantic-guided hierarchical codebooks are unified discrete tokenization frameworks that factorize data into multi-level representations with distinct semantic granularity.
  • The approach employs stage-wise decoupling and hierarchical quantization to optimize both global semantic features and fine-grained details for high reconstruction fidelity.
  • Its applications across vision, language, recommendation, and audio demonstrate enhanced interpretability, efficiency, and controlled generative performance.

Semantic-Guided Hierarchical Codebooks (SemHiTok) define a unified, polynomial-capacity tokenization strategy that decomposes input data into discrete, interpretable representations across multiple semantic levels. This framework underlies a new class of tokenizers and generative models in vision, language, recommendation, audio, and design, yielding state-of-the-art trade-offs in expressivity, modularity, reconstruction fidelity, semantic alignment, and interpretability. The core innovation lies in explicitly factorizing discrete representations into hierarchically organized codebooks, each specializing in a particular semantic or structural granularity.

1. Formal Structure and Mathematical Framework

SemHiTok systems operate by factorizing the representation of modality-specific data (image, text, item, audio, CAD design) into a hierarchy of discrete latent codebooks. Each codebook is associated with a distinct semantic granularity, such as global content, object class, or local detail, and quantizes either the raw encoder feature or the residual left by previous quantization stages.

The fundamental two-level formulation, exemplified in image generation, decouples:

  • A semantic codebook Cs\mathcal{C}_s of size %%%%1%%%% (encoding global content or high-level structure)
  • A detail/pixel codebook Cd\mathcal{C}_d of size n2n_2 (encoding residual, fine-scale information).

For input patch IiI_i with encoder output eiRDe_i \in \mathbb{R}^D: qi,s=argmink[1,n1]eicsk22q_{i,s} = \arg\min_{k \in [1,n_1]} \|e_i - c_s^k\|_2^2

ri=eicsqi,sr_i = e_i - c_s^{q_{i,s}}

qi,d=argminj[1,n2]ricdj22q_{i,d} = \arg\min_{j \in [1,n_2]} \|r_i - c_d^j\|_2^2

yielding the discrete index pair (qi,s,qi,d)(q_{i,s}, q_{i,d}) per patch. The aggregate codebook capacity scales as O(n1n2)O(n_1 n_2), a polynomial expansion over O(N)O(N) for flat codebooks of size NN.

This construction generalizes to LL-level hierarchies for text, recommendations, and design, composing the representation as tuples (y(1),...,y(L))(y^{(1)}, ..., y^{(L)}) with each y(l)y^{(l)} from codebook C(l)\mathcal{C}^{(l)} and sequentially quantizing residuals [2508.04618][2508.04618].

2. Training Objectives and Hierarchical Decoupling

A central feature is the decoupled training of codebooks at different hierarchies:

Semantic Codebook Training

  • Employs a frozen semantic encoder (e.g., CLIP, SigLIP) to extract global features,
  • Trains the codebook via vector quantization (VQ), optimizing both a distillation (cosine similarity) loss to preserve semantic alignment and a VQ commitment loss,

Lsem=1cos(zsem,z^sem)+βVQ-LossL_{\mathrm{sem}} = 1 - \cos(z_\mathrm{sem}, \hat{z}_\mathrm{sem}) + \beta \mathrm{VQ\textrm{-}Loss}

  • Updates codebook centroids via exponential moving average (EMA).

Detail/Pixel Codebook Training

  • Conditioned on the semantic assignment, each location selects a patch-specific sub-codebook,
  • Fine-grained detail is quantized via independent codebooks Cpixk\mathcal{C}_\mathrm{pix}^k for each semantic code kk, promoting efficient coverage of intra-class texture variability,
  • Loss combines 1\ell_1, perceptual (VGG), adversarial (GAN), and VQ commitment terms,

Lrec=XY^1+λcommit+λper+λGANL_\mathrm{rec} = \|X - \hat{Y}\|_1 + \lambda_\textrm{commit} + \lambda_\textrm{per} + \lambda_\textrm{GAN}

  • In all cases, optimization is staged: semantic codebook and encoder are frozen during pixel codebook training, preventing co-adaptation and "tug-of-war" phenomena (Chen et al., 9 Mar 2025).

Hierarchical Supervision and Disentanglement

For label-rich modalities (e.g., recommendation, design CAD), additional losses include:

  • Tag alignment: contrastive or cross-entropy losses aligning code-level representations to human-interpretable tags or text embeddings,
  • Uniqueness loss: angular margin terms penalizing code collisions among non-identical items for maximal codebook utilization and diversity (Fang et al., 6 Aug 2025).

In multi-resolution vision models (e.g., segmentation), the codebook pyramid is further coupled to both pixel and semantic reconstruction pathways with dual-branch supervision (Zhang et al., 2024), and multi-granularity text/image alignment is optimized via Wasserstein or InfoNCE objectives over sampled code/text pairs (Liang et al., 3 Mar 2025).

3. Autoregressive and Conditional Generation Schemes

Downstream sequence models employ hierarchical autoregressive (AR) token generation: p({(ki,ji)}i=1m)=i=1mp(kicontext<i)×p(jiki,context<i)p\left(\{(k_i, j_i)\}_{i=1}^m\right) = \prod_{i=1}^m p(k_i \mid \mathrm{context}_{<i}) \times p(j_i \mid k_i, \mathrm{context}_{<i})

  • Each patch is generated in a coarse-to-fine sequence: first sample the semantic token, then the detail token,
  • The context conditioning window comprises both global transformer context and a localized spatial window for increased spatial coherence in generation (Yi et al., 8 Oct 2025).

Conditional generation incorporates attention-guided adaptive classifier-free guidance (CFG), wherein the logit blending coefficient is spatially modulated by attention scores and temporally by generation progress,

cfg(yi)=u(yi)+λi[c(yi)u(yi)]\ell_{\rm cfg}(y_i) = \ell_u(y_i) + \lambda_i [\ell_c(y_i) - \ell_u(y_i)]

with

λi=si×αi\lambda_i = s'_i \times \alpha_i

where sis'_i is a progressive schedule and αi\alpha_i derives from spatial relevance computed via attention (Yi et al., 8 Oct 2025).

For long-sequence tasks (e.g., generative recommendation, code-tree CAD generation), tokenization and AR decoders are explicitly constrained by prefix tries or semantic plans to guarantee validity and interpretability (Fang et al., 6 Aug 2025, Xu et al., 2023).

4. Representative Applications

SemHiTok architectures have been specialized for and demonstrated high effectiveness in diverse domains:

Domain Hierarchy Example SemHiTok Variant Key Metrics/Results
Vision (Gen/Understanding) Semantic \rightarrow Pixel (Chen et al., 9 Mar 2025) rFID=1.24 (ImageNet+COYO); GQA=58.8; MJHQ30K gFID=11.0
Image AR Generation Semantic \rightarrow Detail (Yi et al., 8 Oct 2025) FID=1.50 (ImageNet, SOTA AR)
Segmentation Pyramid (Early/Mid/Late/Latent) (Zhang et al., 2024) mIoU=31 (OVSS, PAT+CLIP)
Recommendation Category \rightarrow Subcat \rightarrow Type (Fang et al., 6 Aug 2025) Recall@5=0.0543 (+35% over baselines), collisions=2%
Speech Semantic \rightarrow Acoustic (Hussein et al., 1 Jun 2025) WER=21.0%, 2× lower bitrate vs. SpeechTokenizer
CAD Design Solid \rightarrow Profile \rightarrow Loop (Xu et al., 2023) Enables controlled, interpretable CAD completion

Each instance demonstrates that polynomial-capacity, semantic-guided codebooks can simultaneously achieve near-expert reconstruction, high semantic informativeness, and strong downstream task performance, frequently outperforming both pixel-expert and flat-VQ baselines.

5. Comparative Analysis, Ablations, and Interpretability

Comparative studies highlight several distinguishing characteristics:

  • Hierarchical codebook expansion: LL-level codebooks scale as O(NL)O(N^L) without explosion in sequence length, enabling rich representation and efficient, fused per-patch tokens (Chen et al., 9 Mar 2025, Yi et al., 8 Oct 2025).
  • Stage-wise decoupling: Freezing higher-level semantic codebooks while training detail branches prevents destructive interference observed in jointly trained models, offering superior tradeoff between semantic and pixel fidelity (Chen et al., 9 Mar 2025).
  • Disentanglement and completeness: Uniqueness loss and hierarchical tag alignment not only reduce code collisions by an order of magnitude but also yield interpretable and semantically traversable discrete token paths (Fang et al., 6 Aug 2025, Xu et al., 2023).
  • Performance ablations: Removing hierarchical structure, local context, or adaptive CFG invariably degrades both accuracy and interpretability (FID, mIoU, Recall@K) (Yi et al., 8 Oct 2025, Zhang et al., 2024, Fang et al., 6 Aug 2025).
  • Representational efficiency: Compared with pixel-expert variants, SemHiTok’s fused hierarchical design often matches or exceeds reconstruction and semantic scores at substantially lower sequence and embedding costs.

A salient implication is that semantic-guided, hierarchical codebooks enable both interpretable control (e.g., in design/model completion with specified semantic plans) and constrained decoding in AR models, guaranteeing valid, semantically meaningful item or patch outputs (Fang et al., 6 Aug 2025, Xu et al., 2023).

SemHiTok unifies multiple themes in contemporary deep generative modeling:

A plausible implication is that future research will develop general-purpose, interpretable tokenizers for multimodal foundation models by combining semantic-guided hierarchy, cross-modal alignment, and constrained or controlled generation.

7. Key References

These works collectively establish SemHiTok as a rigorous, extensible abstraction for discrete semantic compositionality across diverse domains, offering a principled basis for both foundational research and practical generative, understanding, and control tasks.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Semantic-Guided Hierarchical Codebooks (SemHiTok).