Unified Topology & Geometry Tokenization
- Unified topology and geometry tokenization is a method that fuses global structural relationships with local geometric attributes into a single discrete token sequence.
- It employs structured codebook assignments, hierarchical traversals, and multi-objective training to preserve both semantic and geometric fidelity.
- The unified approach enables controllable generative modeling across domains such as 3D shapes, human motion, and architectural design.
Unified topology and geometry tokenization is a paradigm in representation learning and generative modeling that encodes both the global structural relationships (topology) and the local or continuous attributes (geometry) of complex data in one discrete token sequence or codebook. This approach enables a variety of neural architectures, especially those based on autoregressive or sequence modeling, to reason, generate, and manipulate rich data—ranging from 3D shapes and human motion to architectural plans and natural language—using unified, interpretable, and highly learnable discrete embeddings.
1. Foundations and Motivating Principles
Unified topology and geometry tokenization addresses the need for machine representations that capture not only geometric detail but also spatial or semantic relationships among parts. Traditional approaches often separate these axes, leading either to geometric tokenizations with poor topological awareness (e.g., VQ-VAE for images (Londei et al., 24 Feb 2026)) or to topological encodings without geometric fidelity (e.g., adjacency graphs alone). Unification produces a token vocabulary or sequence in which each element participates in both the global structure (e.g., patch, room, face, or occupancy context) and local content (e.g., spatial coordinates, texture, or physical parameters).
Several design principles unify these methods:
- Discrete representation: The entire semantic–geometric structure is encoded as a sequence or set of tokens, permitting sequence modeling.
- Topology-geometry fusion: Tokens encode part relationships (e.g., adjacency, patch hierarchy) together with local geometry (e.g., coordinates, features, quantized codes).
- Learnability and interpretability: The organization of the token space (via grids, graphs, trees, or codebooks) aligns with meaningful semantics or geometry, facilitating controllable or interactive modeling.
- Compatibility with modern modeling: Sequence-based architectures (Transformers, diffusion models, autoregressive LMs) natively consume such unified tokens, enabling tasks impossible or inefficient with separable representations.
2. Architectures and Methodological Variants
A range of architectures instantiate unified topology and geometry tokenization, each tailored to domain constraints. Prominent examples include:
Grid-based Vector Quantization with Explicit Topology
- SOM-VQ couples classic vector-quantized autoencoding (VQ-VAE) with Self-Organizing Maps (SOMs) to impose a low-dimensional grid topology on the codebook. Each latent code takes a fixed grid coordinate; nearby tokens in code-space correspond to semantically similar events. The training alternates between a local SOM update (topology preservation via neighborhood-sensitive code adjustment) and a commitment update (EMA stabilization) (Londei et al., 24 Feb 2026).
- Geometry control arises from the meaningful codebook geometry: manipulating token locations effects coherent, semantic change in generated content.
Graph- and Sequence-Based Topology Embeddings
- BrepARG and AutoBrep represent boundary representations (B-Rep) of CAD models as holistic token sequences. They encode both geometry (through VQ-VAE or FSQ-based latent tokens for surface/curve patches and quantized bounding boxes) and topology (via face/edge indexing, adjacency, and local reference windows). A single sequence is built by traversing the face adjacency graph in a topology-aware order (DFS for BrepARG (Li et al., 23 Jan 2026), BFS for AutoBrep (Xu et al., 2 Dec 2025)).
Tree and Octree Hierarchies
- TreeTOp encodes both geometry (parameters of geometric primitives) and topology (tree-structured Boolean operations) as a fully differentiable Constructive Solid Geometry (CSG) tree. Each internal node carries learnable weights that interpolate between union, intersection, and subtraction operations, while leaf nodes encode geometry (Padhy et al., 2024).
- Uni-3DAR employs an octree to encode spatial occupancy (topology) and fine-grained atom type + intra-cell geometry (geometry) for 3D molecular data. Subtree compression and masked next-token prediction optimize autoregressive sequence length and learnability (Lu et al., 20 Mar 2025).
Patch- and Superpoint-Based Methods
- GeoTransformer applies algebraic multigrid root-node selection and heat diffusion positional encodings to group mesh vertices into tokens, preserving mesh topology and stable geometry embeddings. Attention mechanisms are localized by geodesic masks, and cross-attention recovers node-level features (Farazi et al., 2024).
- S4Token (CLIP-based 3D understanding) generates tokens by superpoint over-segmentation for topology and coordinate-normalized features for geometry. Downstream masking, clustering, and distillation objectives fuse both aspects into scale-invariant, generalizable tokens (Mei et al., 24 May 2025).
Multimodal Sequence Tokenizations
- HouseMind tokenizes architectural floorplans into room-instance tokens, each containing a semantic label, quantized shape descriptors, and contextually encoded adjacency (implicit topology). The encoder conditions every room on the global outline, ensuring adjacency is recoverable and relationships are represented in the joint token sequence (Qin et al., 12 Mar 2026).
A summary table illustrates these variants:
| Method/Paper | Topology Encoding | Geometry Encoding | Tokenization Structure |
|---|---|---|---|
| SOM-VQ (Londei et al., 24 Feb 2026) | 2D grid on VQ codebook | Continuous latent quantized to grid | Grid-indexed VQ tokens |
| BrepARG (Li et al., 23 Jan 2026), AutoBrep (Xu et al., 2 Dec 2025) | Face-index, adjacency, reference tokens | VQ-VAE/FSQ latents, bounding boxes | Ordered token sequence |
| TreeTOp (Padhy et al., 2024) | Binary CSG tree, Boolean weights | Primitives’ parametric shapes | Tree-structured latent |
| Uni-3DAR (Lu et al., 20 Mar 2025) | Octree occupancy, subtree codes | Atom type + in-cell quantized coord. | 1D sequence (octree) |
| GeoTransformer (Farazi et al., 2024) | Patch graph, root-node map | HKS/diffusion, patch pool | Patch tokens, attention |
| S4Token (Mei et al., 24 May 2025) | Superpoints, centroid clusters | Normalized features | N-tokens (scene points) |
| HouseMind (Qin et al., 12 Mar 2026) | Conditioned room adjacency | Shape/presence in outline | Interleaved text + room |
3. Token Space Topology and Geometry: Theoretical and Empirical Analysis
Recent work establishes that not only explicit codebooks, but also natural language token embedding spaces in LLMs, possess intrinsic topological and geometric structure. In large LMs, the embedding map places each token in a high-dimensional latent, and volume–radius analysis reveals a stratified manifold rather than a single smooth space. Local dimension and Ricci scalar curvature correlate with model fluency and behavior: high-curvature or collapsed strata result in poor generalization in domains like numerals (Robinson et al., 2024). Designing tokenization mechanisms with controlled topological and geometric properties in the embedding space can thus have direct empirical impact on model expressivity, stability, and controllability.
4. Applications and Empirical Validation
Unified topology and geometry tokenization underpins state-of-the-art results in several domains:
- Generative modeling: Topology-aware tokenization (SOM-VQ, HouseMind) enables controllable, semantically aligned generative models for human motion, architecture, and gestures (Londei et al., 24 Feb 2026, Qin et al., 12 Mar 2026).
- 3D and CAD modeling: Holistic token sequences—encoding both geometry and adjacency—permit efficient and valid B-Rep, mesh, or molecular structure generation with transformers or autoregressive LMs (Xu et al., 2 Dec 2025, Li et al., 23 Jan 2026, Lu et al., 20 Mar 2025, Farazi et al., 2024).
- Multimodal and vision-language: Cross-modal alignment and interpretability are achieved by mapping both topology (via over-segmentation or occupancy) and geometry (coordinate features or image-aligned patch latents) to discrete, generalizable tokens, as in S4Token (Mei et al., 24 May 2025).
- Camera-controllable video generation: By fusing depth, pose, and appearance into a unified token stream, methods such as CETCam achieve precise geometric control and adaptability during video synthesis (Zhao et al., 22 Dec 2025).
Reported empirical results consistently show that unified tokenization yields improved reconstruction error, increased validity or coverage, and more compact, controllable representations relative to baselines that treat topology and geometry separately (Li et al., 23 Jan 2026, Xu et al., 2 Dec 2025, Lu et al., 20 Mar 2025, Qin et al., 12 Mar 2026).
5. Algorithmic Building Blocks and Training Strategies
Tokenization pipelines combine several algorithmic ingredients to achieve unification:
- Structured codebook assignment: Topologically aware quantization, as with SOM updates or spatially regularized clustering, imparts explicit structure to the latent codes (Londei et al., 24 Feb 2026, Mei et al., 24 May 2025).
- Hierarchical traversals: Deciding processing order by BFS, DFS, or patch hierarchies ensures topological locality for sequence models (Xu et al., 2 Dec 2025, Li et al., 23 Jan 2026).
- Multi-objective training: Simultaneous alignment losses (e.g., semantic, geometric, and reconstruction) force the token sequence to bind appearance, shape, and structure (Zhuo et al., 19 Mar 2026, Lu et al., 20 Mar 2025).
- Positional and structural embeddings: Feature representations (e.g., heat kernel, diffusion, normalized coordinates) ensure geometric, isometry-invariant encoding (Farazi et al., 2024, Mei et al., 24 May 2025).
- Local context windows: Restricting reference tokens or attention masks to local subgraphs or geodesic neighborhoods constrains the combinatorial explosion of plausible topologies (Xu et al., 2 Dec 2025, Farazi et al., 2024).
6. Limitations, Open Challenges, and Future Directions
While unified topology and geometry tokenizations have demonstrated broad applicability, several challenges remain:
- Resolution and expressivity trade-offs: Quantization (VQ, FSQ) imposes information bottlenecks, sometimes hindering the accurate reconstruction of highly irregular or fine-grained features (Qin et al., 12 Mar 2026, Li et al., 23 Jan 2026).
- Explicit vs. implicit topology: Some methods rely on implicit relationships inferred from context, which can complicate direct editing or validation (e.g., adjacency in HouseMind (Qin et al., 12 Mar 2026)).
- Scalability and generalization: Ensuring that tokenization schemes generalize across domain scales or complexities requires careful normalization and design, as in S4Token's coordinate scale invariance (Mei et al., 24 May 2025).
- Analysis and diagnostics: Embedding diagnostics (dimension, curvature; (Robinson et al., 2024)) are not yet a standard component of tokenizer design, and the connection to downstream model robustness remains an area of ongoing research.
A plausible implication is that future tokenization frameworks will increasingly employ adaptive, data-driven codebook and sequence structures informed by intrinsic topology–geometry diagnostics, further enhancing the interpretability and reliability of generative and reasoning systems.