Hierarchical BrepEncoder for B-rep CAD Models
- Hierarchical BrepEncoder is a unified neural encoder that integrates geometric tokens and topological relationships for effective B-rep CAD processing.
- It employs sequence, graph, or tree-based architectures to capture multi-scale features, enabling tasks such as generative modeling and cross-modal alignment.
- Empirical results demonstrate improved efficiency and accuracy in B-rep generation, cross-modal grounding, and self-supervised representation learning.
A Hierarchical BrepEncoder is an architectural class of neural encoders designed for processing the geometry and topology of Boundary Representation (B-rep) CAD models in a unified, multi-scale, and information-preserving manner. Hierarchical BrepEncoders underpin multiple state-of-the-art frameworks for generative modeling, cross-modal alignment, and self-supervised representation learning with B-rep input, utilizing tailored graph, sequence, or tree structures to encode faces, edges, and their mutual relationships.
1. Core Principles and Motivation
Traditional CAD B-rep processing architectures historically disentangle geometric and topological information using graph-based approaches with decoupled computational blocks, thereby limiting compatibility with sequence-based or transformer-based learning paradigms. Hierarchical BrepEncoders address this limitation by unifying geometry and topology through an explicit architecture—or sequence—hierarchy: geometry tokens or features are extracted for each primitive, structured hierarchically (face → edge → possibly vertex), and integrated to create holistic token sequences or latent trees. This hierarchy supports long-range geometric reasoning, topological context encoding, and efficient downstream model training via transformers or diffusion models (Li et al., 23 Jan 2026, Deng et al., 18 Dec 2025, Xu et al., 2024, Li et al., 16 Mar 2026).
2. Hierarchical Encoding Strategies
2.1 Token-Sequence-Based Encoders
BrepARG (Li et al., 23 Jan 2026) encodes each primitive (face, edge) into a set of three token types:
- Geometry Tokens: Obtained by downsampling UV-sampled surface patches (faces: ; edges: UV-broadcasted curves) with a VQ-VAE, then quantizing to a codebook, resulting in four tokens per primitive.
- Position Tokens: Derived by quantizing each primitive’s axis-aligned bounding box into six discrete scalars, mapping real coordinates to uniformly quantized integers in .
- Face Index Tokens: Topology is encoded as integer face indices, assigned uniquely to each face and referenced from the two adjoining faces for each edge.
Tokens are constructed into geometry blocks (per-primitive), ordered hierarchically (sequence of faces, then edges), and assembled with unambiguous separators into a flat, holistic sequence suitable for autoregressive next-token prediction via transformer models. This enables fully end-to-end, sequence-based learning of both geometric and topological dependencies.
2.2 Graph and Multi-Level Feature Fusion Encoders
BrepLLM (Deng et al., 18 Dec 2025) and Masked BRep Autoencoder (Li et al., 16 Mar 2026) utilize adaptive UV/curve sampling to convert each B-rep into a face-adjacency graph:
- Face Nodes: Rich geometry tensors sampled and encoded via PointTransformer or CNNs, capturing detailed surface properties.
- Edge Features: Curve samples between shared faces, with dedicated MLP or CNN feature extractors, facilitating local message passing.
- Hierarchical Feature Extraction: Feature branches include:
- Fine geometry (self-attention among sampled points per face),
- Local topological message passing (edge-conditioned convolution/NNConv over the graph),
- Global topology (graph attention over node embeddings).
- Pooling: Per-face tokens are aggregated into global tokens using attention pooling.
A contrastive InfoNCE loss (in BrepLLM) aligns global geometry-text embeddings, while in self-supervised setups (Masked BRep Autoencoder), masked graph reconstruction with a hierarchical graph transformer is used for transfer learning.
2.3 Tree-structured Latent Geometry
BrepGen (Xu et al., 2024) represents B-rep models as explicit hierarchical trees via node duplication:
- Level 0: Root node (solid).
- Level 1: Face nodes, each with a bounding box and VAE-compressed surface samples (shape latent).
- Level 2: Edge nodes, with bounding box, VAE-compressed edge samples, and explicit endpoints.
- Level 3: Vertex nodes, represented directly as coordinates.
Topological sharing (e.g., adjacency through shared edges) is embedded by duplicating nodes; deduplication and postprocessing steps reconcile the tree structure with the original graph-based topology on reconstruction.
3. Model Architectures and Training Objectives
3.1 Sequence Transformer
In BrepARG (Li et al., 23 Jan 2026), token sequences are modeled with a transformer decoder (8 layers, 8 heads, ), using causal (autoregressive) self-attention. Each token is embedded and used in next-token prediction under an end-to-end cross-entropy loss:
Optimizer: AdamW with learning rate scheduling and gradient clipping.
3.2 Cross-modal Alignment
BrepLLM (Deng et al., 18 Dec 2025) employs CLIP-style contrastive loss between the B-rep global token and frozen ViT-L/14 CLIP text features:
where and are cross-modal matching probabilities, with both embeddings -normalized.
3.3 Diffusion Models
BrepGen (Xu et al., 2024) uses tree-structured latent geometry as denoising diffusion model input, with standard DDPM forward/reverse processes and conditioning by hierarchical token addition (parents’ embeddings added to children tokens). Each tree level is reconstructed via Transformer-based denoisers, optimizing the mean squared error of the predicted denoising vector at each step.
3.4 Self-supervised Masked Graph Autoencoding
The Masked BRep Autoencoder (Li et al., 16 Mar 2026) applies input masking (≈70% nodes/edges), hierarchical cross-scale mutual attention transformer plus local message passing, and a multi-term loss reconstructing both latent and explicit geometry:
4. Comparative Structure of Prominent Hierarchical BrepEncoders
| Framework | Input Structure | Hierarchy Type | Geometric Encoding | Topology Handling | Target Downstream Use |
|---|---|---|---|---|---|
| BrepARG (Li et al., 23 Jan 2026) | Sequence (tokens) | Hierarchical flat | VQ-VAE tokens, quant. bounding box | Index tokens, DFS ordering | Autoregressive B-rep generation |
| BrepLLM (Deng et al., 18 Dec 2025) | Face adjacency graph | Multi-branch fusion | Attentive PointTransformer/CNN per face | Edge-conditioned message passing | Cross-modal understanding, LLM grounding |
| BrepGen (Xu et al., 2024) | Latent geometry tree | Multi-level tree | VAEs for faces/edges, direct coords | Mating and association duplication | Diffusion-based B-rep synthesis |
| Masked BRep AE (Li et al., 16 Mar 2026) | gAAG (graph) | Cross-scale hier. | CNN (coarse/fine UV), MLPs | MPNN over explicit adjacency | Self-supervised representation learning |
Each method exploits explicit hierarchy in the B-rep: geometry is encoded at the primitive level, topology is handled either explicitly (sequence, graph) or implicitly (tree via duplication), and global pooling or attention is employed for holistic shape understanding.
5. Experimental Performance and Practical Implications
BrepARG achieves state-of-the-art results on DeepCAD (COV=75.45%, Novel=99.82%, Unique=99.80%, Valid=87.60%), surpassing prior graph-based generation techniques in coverage and validity (Li et al., 23 Jan 2026). Inference and training time are reduced compared to previous methods, reflecting the efficiency achieved by hierarchical flat tokenization.
These hierarchical encoders remove the need for complex multi-stage components (e.g., pointer networks, explicit graph-matching), enabling direct sequence or tree-based generative and representation learning. They also provide a unified path from low-level geometric samples to high-level tasks (generation, cross-modal alignment, few-shot recognition), supporting both conditional (text-to-CAD, segmentation) and unconditional workflows.
6. Extensions to Other CAD Paradigms and Future Directions
Hierarchical BrepEncoding provides a flexible template generalizable to other paradigms:
- CSG Sequences: Modeling constructive solid geometry via tokenized Boolean operations and primitive representations.
- NURBS/Free-form Surfaces: Tokenizing knot vectors and control points in analogy to UV patch sampling.
- Multi-Body Assemblies: Introducing hierarchical body-index tokens and expanded separator schemes to extend encoders to complex assemblies.
- Conditional and Chain-of-Thought Modeling: Enabling prompt-based or stepwise design intent specification.
Planned improvements include higher-fidelity quantizers (e.g., hierarchical VQ), more structured regularization to reduce geometric reprojection error, and deeper integration with text-guided or chain-of-thought generative workflows (Li et al., 23 Jan 2026).
7. Significance and Future Research Trajectories
The hierarchical BrepEncoder paradigm has established a foundation for robust B-rep representation across multiple domains: sequence generation, cross-modal grounding, diffusion-based synthesis, and self-supervised representation transfer. A plausible implication is that further integrating such encoders with foundation models for 3D and cross-modality tasks will accelerate data-driven design, manufacturing, and downstream semantic understanding. Open challenges include scaling to larger assemblies, incorporating physical properties, and handling multiple CAD representations in a fully unified fashion.