BrepLLM: Integrating 3D CAD and Language Models

Updated 26 December 2025

BrepLLM is a framework that bridges structured CAD boundary representations with language models, enabling native parsing and reasoning of 3D geometric data.
It employs a two-stage training pipeline with adaptive sampling, hierarchical graph encoding, and contrastive pretraining to align geometric and textual semantics.
BrepLLM achieves state-of-the-art performance on 3D captioning and classification benchmarks using the novel Brep2Text dataset with over 269K Brep–text pairs.

BrepLLM is a framework for enabling LLMs to natively parse, reason over, and generate knowledge from raw Boundary Representation (Brep) data, directly bridging structured 3D geometric/topological information and natural language. Standard LLMs and vision-LLMs process flat data (text, images, point clouds) as unstructured sequences, making them fundamentally incompatible with the graph-structured, watertight, and parametric characteristics inherent to Breps—the industry standard in Computer-Aided Design (CAD). BrepLLM introduces a two-stage pipeline that unifies adaptive geometric sampling, hierarchical graph encoding, cross-modal contrastive pretraining, and progressive multi-stage LLM fine-tuning, substantially outperforming prior methods on 3D captioning and classification benchmarks involving native Brep objects (Deng et al., 18 Dec 2025).

1. Foundations: Brep Data and Motivation

Boundary Representations (Breps) describe 3D solids via exact parametric surfaces (“faces”), trimmed curves (“edges”), and watertight adjacency graphs. Each face may possess a unique $(u,v)$ parameter domain, curvature, and normal field, while topological relations encode multi-face adjacencies and global assembly. Such structured representations are essential for CAD applications requiring precise geometry, high fidelity, and explicit feature awareness.

Prior “CAD-LLM” approaches circumvent Brep processing by relying on procedural command histories (e.g., sequences of sketches and extrusions), which inherently lose access to explicit topology and fine geometric detail, limiting their reasoning and generation capacity. The modality gap thus precludes direct ingestion of structured Brep data by conventional LLMs or even multi-modal transformers, motivating the need for fundamentally new approaches (Deng et al., 18 Dec 2025).

2. Two-Stage Training Pipeline

The BrepLLM architecture is built around a two-stage training regime designed to align Brep geometry/topology with language representations and enable downstream LLM reasoning.

Adaptive UV and Edge Sampling

For each face $S$ with area $A_S$ , adaptive sampling distributes $N_S$ points over its $(u,v)$ domain $\Omega_S = [u_\text{min}, u_\text{max}] \times [v_\text{min}, v_\text{max}]$ :

$N_S = N_\text{min}^\text{face} + \frac{A_S-A_\text{min}}{A_\text{max}-A_\text{min}} \cdot (N_\text{max}^\text{face}-N_\text{min}^\text{face})$

Similar normalization is applied to edge $C$ by length $\ell_C$ for edge samples $M_C$ .

Face samples yield a 10-dimensional vector at each $(u_k,v_l)$ : 3D point $P(u,v)$ , normal $n(u,v)$ , curvature $H(u,v)$ , binary flag $V$ , face type $t$ , normalized area $a$ . Edges are encoded as 8-dimensional vectors with analogous geometric and type information.

Hierarchical BrepEncoder

Face nodes and edge adjacencies form a graph $G=(V,E)$ .
Node features include:
- Fine-grained: PointTransformerV3 extracts $F_f^{(i)} \in \mathbb{R}^{32}$ from sampled face points.
- Edge-conditioned: NNConv derives $F_e^{(i)} \in \mathbb{R}^{32}$ using incident edge geometry.
- Global topology: EGATConv computes $F_t^{(i)} \in \mathbb{R}^{64}$ over adjacency graph nodes.
Each node token $h_i = [F_t^{(i)} \| F_e^{(i)} \| F_f^{(i)}] \in \mathbb{R}^{128}$ ; an attention-pooled global token $h_\mathrm{cls} \in \mathbb{R}^{128}$ enables full-graph summarization.

Contrastive Learning Objective

$h_\mathrm{cls}$ is projected via MLP to $\mathbb{R}^D$ and aligned with $\mathbb{R}^D$ text embeddings $z_\text{text}$ (from a frozen CLIP text encoder) using a symmetric InfoNCE loss over batches of paired samples:

$L_\text{contrastive} = -\frac{1}{2N} \sum_{i=1}^N \left[ \log \frac{\exp(S_{ii}/\tau)}{\sum_{k=1}^N \exp(S_{ik}/\tau)} + \log \frac{\exp(S_{ii}/\tau)}{\sum_{k=1}^N \exp(S_{ki}/\tau)} \right]$

Node tokens $\{h_i\}$ are passed forward for LLM integration; the global token is used solely for pre-training alignment.

2.2 Multi-Stage LLM Fine-tuning

Node tokens $X_\text{geo}\in\mathbb{R}^{n\times128}$ are projected into a vision-language backbone (e.g. Phi-2 with Q-Former).

Stage I: Geometry–Vision Bridging

A two-layer MLP maps each node token to $D_\text{qf}$ -dim inputs, truncated/padded to $T_\text{max{}}=128$ .
Q-Former cross-attends to $X_\text{qf}$ . Only MLP and final head are initially trained, exploiting 2D vision-language priors to semantically bridge Brep tokens and language.

Stage II: 3D–Language Alignment (LoRA Tuning)

LoRA modules tune select Q-Former and LLM layers, biasing vision-language modules toward genuine 3D semantics.

Stage III: Mixture-of-Query Experts (MQE)

The Stage-II Q-Former query set $E_\text{base}$ is frozen as $Q_\text{base}$ .
$k$ trainable expert query sets $E_1\ldots E_k$ and a sparse router $R(X_\text{geo})$ select top- $G$ experts per sample.
Each selected expert adds a residual $Q_\text{residual}$ to $Q_\text{base}$ ; only residual experts and router are trained, improving geometric diversity and stability.

3. Brep2Text Dataset

BrepLLM introduces Brep2Text, the first large-scale dataset for instruction-tuning LLMs on raw Brep representations, totaling 269,444 Brep–text QA pairs. The dataset construction process:

Source: Text2CAD corpus of 134,722 Brep models with detailed, human-written high-level descriptions.
For each Brep: two semantic prompt tiers—
- Abstract (function, category, global shape): e.g., “Identify function.”
- Beginner (constructive history): e.g., “List the sequence of sketch/extrude operations.”
Qwen-Max LLM is used in reverse to generate natural-language question–answer pairs, with human-written text as ground truth.
Dataset split: 200 unique Brep models in the test set (no train/val overlap), with the remainder used for model development (Deng et al., 18 Dec 2025).

4. Experimental Evaluation

BrepLLM achieves superior results compared to point-cloud-based and mesh-based LLMs on both 3D object captioning and generative classification.

4.1 3D Object Captioning

On the Brep2Text dataset, evaluation uses automatic metrics (Qwen-Max, SBERT, SimCSE) and human annotation.

Model	Input	Qwen-Max	SBERT	SimCSE	Human Prec (%)
PointLLM-7B	Pts Cloud	46.81	65.72	66.05	74.60
ShapeLLM-13B	Pts Cloud	51.36	68.36	70.12	73.47
MiniGPT-3D (2.7B)	Pts Cloud	56.58	71.64	73.13	79.40
BrepLLM (2.9B)	Brep	58.89	73.05	74.46	81.85

4.2 Generative 3D Object Classification

Prompt styles include identity-inquiry and completion (“What is this?”, “This is an object of…”). Model outputs are validated using Qwen-Max automatic judgment.

Model	I (%)	C (%)	Avg (%)
PointLLM-7B	52.70	50.10	51.40
ShapeLLM-13B	52.90	53.70	53.30
MiniGPT-3D	55.70	54.10	54.90
BrepLLM	57.40	56.70	57.05

4.3 Ablation Studies

Ablations highlight the impact of each pipeline component:

Component	Gain (%)	Condition
Adaptive UV sampling (full pipeline)	+2.05	Stage I
Hierarchical BrepEncoder	+2.78	Stage I
Stages I+II+III (full pipeline)	57.05	All stages
MQE in III only	57.05	Timing

5. Contributions and Comparative Positioning

BrepLLM is distinguished by several contributions:

The first framework for direct Brep-to-LLM reasoning and generation without procedural tokenization.
Adaptive UV/length sampling schemes enabling translation of parametric geometry into high-fidelity, information-rich attribute graphs, supporting nuanced geometric and topological modeling.
Hierarchical BrepEncoder integrating face, edge, and topology features into both sequence and global tokens—crucial for both local and holistic reasoning.
A two-stage regime unifying CLIP-style Brep–text contrastive pretraining with a three-stage progressive LLM tuning culminating in Mixture-of-Query Experts for geometric diversity.
Introduction of Brep2Text, the first instruction-tuning dataset pairing raw Brep data with semantically rich text (269,444 pairs), establishing a new benchmark for Brep understanding.
State-of-the-art performance on both 3D captioning and generative classification, realized using a compact 2.9B-parameter LLM (Deng et al., 18 Dec 2025).

6. Limitations and Prospects

BrepLLM currently targets single-part, open-CAD objects; multi-body assemblies, kinematic constraints, and in-the-loop constraint solving are outside its present operational domain. Brep2Text annotations are generated via automated LLM prompting; incorporating human-curated multi-turn dialogues and advanced geometric queries (e.g. constraint satisfaction, distance, angle) would widen the framework’s applicability. Potential extensions include scaling to larger LLM backbones, integrating additional modalities (rendered images, point clouds), and expanding Mixture-of-Query Experts for continual specialization, especially for sub-domains such as sheet metal and freeform surfaces (Deng et al., 18 Dec 2025).

A plausible implication is that as BrepLLM and related architectures mature, direct language-driven interaction with exact CAD geometry—without intermediate procedural detours—will become tractable for a wide range of engineering and scientific workflows, enabling high-fidelity, explainable, and semantically grounded geometric reasoning.

Markdown Upgrade to Chat

References (1)

BrepLLM: Native Boundary Representation Understanding with Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BrepLLM.