Shape Token: Compact Geometric Encoding

Updated 18 February 2026

Shape tokens are compact latent representations encoding geometric, structural, and region-specific features of 2D/3D objects.
They employ various encoding strategies such as continuous vectors, discrete codebook indices, and spatially-parametric tokens to suit different models and tasks.
Shape tokens enable efficient conditional generation, recognition, and rendering by integrating with architectures like Transformers and diffusion models.

A shape token is a compact latent representation encoding geometric, structural, or region-specific information about a 2D or 3D object, typically designed for consumption by contemporary neural architectures including Transformers, diffusion models, and autoregressive frameworks. Shape tokenization constitutes a foundational design choice in modern vision, graphics, and multi-modal pipelines, supporting conditional generation, recognition, rendering, and geometric alignment across a diverse array of tasks and modalities. The formalism and implementation of shape tokens vary widely: from continuous latent vectors summarizing local 3D geometry, to codebook indices encoding quantized spatial features, to Gaussian footprints parameterizing flexible image regions.

1. Formal Definitions and Taxonomy

Shape tokens are latent entities extracted from geometric data—point clouds, meshes, signed distance fields (SDFs), images, or hybrid 2D/3D structures—via dedicated encoders. Their mathematical embodiment and semantic content depend on intended application:

Continuous Latent Vectors: LTM3D (Kang et al., 30 May 2025), 3D shape tokenization via latent flow (Chang et al., 2024), and INT (Yang et al., 2023) implement shape tokens as continuous vectors, where each token encodes local sub-volume/patch geometry, shape coefficients, or appearance features.
Discrete Codebook Indices: Kyvo (Sahoo et al., 9 Jun 2025) and OAT (Deng et al., 3 Apr 2025) define shape tokens as discrete indices selected by vector quantization from a learned codebook, which can reconstruct volumetric or octree-based geometry.
Spatially-Parametric Tokens: GPSToken (Zhang et al., 1 Sep 2025) introduces tokens endowed with explicit 2D Gaussian parameters defining their spatial footprint, in addition to texture features.
Transformer-Sequenced Tokens: Vision Transformers (ViT) and MLPs operating in image or geometric domains (SETA (Guo et al., 2024)) treat patch-wise features as tokens, enabling explicit alignment between token indices and object shape.

The diversity of designs leads to a taxonomy dictated by (i) discrete vs. continuous embedding, (ii) joint encoding of structure and appearance, and (iii) modality (2D, 3D, or joint) specificity.

2. Shape Tokenization Methodologies

Encoding strategies determine how shapes are partitioned and tokens are generated:

Latent Encoder Architectures: Variational autoencoders (VAEs), Perceiver-based models, dedicated multi-scale CNN encoders, and VQ-VAEs are employed depending on the target geometry. For example, LTM3D’s encoder $f_\mathrm{enc}$ maps a 3D representation $\mathcal{S}$ to a sequence $X = \{x_1, \dots, x_n\}$ , while 3D shape tokenization via flow-matching applies cross-attention and self-attention to point clouds to yield a fixed set of latent vectors $s \in \mathbb{R}^{k \times d}$ (Chang et al., 2024).
Adaptive Tokenization: OAT constructs an octree guided by quadric error, allocating tokens adaptively such that more tokens are focused in geometrically complex regions; each resulting token encodes both quantized local features $q(v_i)$ and a tree-structural code $\chi(v_i)$ (Deng et al., 3 Apr 2025). SETA perturbes ViT-style tokens in the Fourier domain to create domain-invariant shape cues while disrupting style and edge content (Guo et al., 2024).
Codebook Quantization: Discrete tokenization involves mapping local latent descriptors to the nearest vector in a learned codebook, as in Kyvo and OAT (Deng et al., 3 Apr 2025, Sahoo et al., 9 Jun 2025), supporting efficient autoregressive modeling and cross-modal sequence unification.

The selection and arrangement of tokens reflect both geometric saliency and downstream task requirements, ensuring effective encoding of shapes’ critical features.

3. Integration in Generative and Recognition Pipelines

Shape tokens act as the fundamental atomic units in diverse neural workflows:

Diffusion and Autoregressive Modeling: LTM3D integrates masked autoencoding (MAE) with per-token diffusion. The generative process is factorized autoregressively: $p(X|T) = \prod_{i=1}^n p(x_i|x_{<i},T)$ , where $X$ is the shape-token sequence and $T$ is the prompt-token context. Each predicted token is denoised via a learned reverse diffusion process, combining continuous latent fidelity with autoregressive contextualization (Kang et al., 30 May 2025).
Conditional Generation and Prompt Fusion: Shape tokens are incorporated into diffusion U-Nets (ShapeWords (Petrov et al., 2024)) by fusing them with language embeddings via cross-attention blocks. Shape-aware prompt tokens (e.g., “a red [shape3d:ID] on a beach”) allow images to be synthesized that respect both textual style and precise 3D shape.
Joint Multi-Modal Alignment: Kyvo’s unified transformer enables text, image, and shape tokens to coexist in a single autoregressive stream (Sahoo et al., 9 Jun 2025). Cross-modal self-attention provides rich inter-token interactions, and a single next-token cross-entropy objective aligns all modalities.
Adaptive Decoding: GPSToken employs Gaussian parameterized tokens, each decoded to a 2D region via differentiable splatting, and subsequently rendered to produce pixel-level reconstructions. The spatial and appearance parameters are refined end-to-end via transformers (Zhang et al., 1 Sep 2025).

Consistent decoding or reconstruction modules map the tokenized representation back to the target geometry, often using transformers, MLPs, or specialized decoders matched to the upstream encoding.

4. Structural and Geometric Interpretability

Shape tokens often maintain spatial, semantic, or hierarchical interpretability:

Hierarchical Partitioning: OAT’s octree-based tokens reflect the underlying geometric partition, and each token’s structural code enables hierarchical, interpretable serialization (Deng et al., 3 Apr 2025).
Spatial Footprint: GPSToken tokens’ 2D Gaussian parameters—mean, covariance, correlation—directly parameterize their patch-like spatial support, with empirical results showing allocation of elongated tokens to edges and large, isotropic tokens to uniform regions (Zhang et al., 1 Sep 2025).
Semantic Alignment: In INT, the shape token explicitly encodes SMPL shape coefficients for full-body human reconstruction, maintaining a decodable bridge to global shape attributes (Yang et al., 2023). Similarly, TEASER’s tokens summarize multi-scale facial feature information for high-precision 3D expression reconstruction (Liu et al., 16 Feb 2025).

Such design enhances model explainability and supports downstream manipulation (e.g., swapping, clustering, or modulating tokens for pose, identity, or expression transfer).

5. Applications Across Domains

Shape tokens facilitate a wide range of tasks:

Application	Example Models	Token Functionality
Conditional 3D Generation	LTM3D, OAT	Conditioned sampling, multi-format geometry
3D Pose/Shape Estimation	INT, TEASER	Global and local geometry prediction from images
Text-to-Image/3D Synthesis	ShapeWords, Kyvo	CLIP-space fusion, prompt-aligned geometry
Robust Visual Recognition	SETA	Domain-invariant representation, shape-biased tokens
Image Generation	GPSToken	Non-uniform, content-adaptive region encoding

In addition, advanced designs permit zero-shot estimation of surface normals, UVW parameterizations, and other geometric attributes owing to the differential structure encoded by the tokens (Chang et al., 2024).

6. Empirical Performance and Quantitative Analysis

Large-scale evaluation demonstrates that shape tokenization underpins SOTA performance in accuracy, fidelity, and generalizability:

Statistical Improvements: LTM3D achieves lower Chamfer Distance (CD), higher F-scores, and improved ULIP metrics across text- and image-conditioned 3D generation compared to previous methods (Kang et al., 30 May 2025). OAT’s adaptive tokenization yields fewer tokens yet better IoU and lower CD than fixed-size VQ baselines (Deng et al., 3 Apr 2025).
Recognition and Generation: Kyvo delivers Jaccard index gains of 0.6415 on synthetic scenes and outperforms Cube R-CNN by over 15 points Jaccard on Objectron (Sahoo et al., 9 Jun 2025). TEASER improves LPIPS and FID scores for fine-grained 3D face expression modeling (Liu et al., 16 Feb 2025).
Sample and Reconstruction Quality: GPSToken achieves reconstruction FID of 0.65 and generation FID of 1.50 on ImageNet 256×256 using 128 tokens (Zhang et al., 1 Sep 2025). SETA bridges the generalization gap in ViT architectures by compelling models to rely on token-carried shape signals and empirically reduces risk bounds for cross-domain recognition (Guo et al., 2024).
Functional Advantages: Flow-matched 3D shape tokens enable surface normal estimation and deformation field extraction via direct score computation, demonstrating the expressive versatility of token-based latent representations (Chang et al., 2024).

7. Future Prospects and Open Challenges

Current research trends point toward increasingly universal, modular, and geometry-aware tokenization frameworks. Ongoing directions include:

Cross-Modal and Cross-Representation Generalization: Models like LTM3D and Kyvo aim for token spaces supporting simultaneous 2D, 3D, and linguistic conditioning without architectural modification.
Token Granularity and Adaptivity: Adaptive token assignment (OAT, GPSToken) reacts to scene complexity, promising improved representational efficiency and better scalability to very high-resolution domains.
Semantic and Structural Alignment: Strategic design of tokenization and prompt fusion mechanisms (ShapeWords, INT) strives to close domain gaps between language, image, and geometry embeddings.
Efficiency vs. Fidelity Trade-offs: Varying the dimensionality and density of tokens (flow-matching models, adaptive quantization) exposes trade-offs in generative detail, computational cost, and interpretability.

A plausible implication is that token-based representations will become ever more central to unified multi-modal AI systems, underpinning future advances in controllable generation, robust scene understanding, and high-fidelity differential rendering across domains.