CAD-Tokenizer: A Framework for Text-to-CAD Generation
- CAD-Tokenizer is a specialized framework that converts CAD construction sequences, such as sketches and extrusions, into modality-aware primitive tokens.
- Its architecture employs a sequence-based VQ-VAE with an adapter module and FSA-constrained decoding to ensure syntactically valid and semantically rich tokenization.
- This approach enhances text-to-CAD generation and editing by preserving geometric fidelity and procedural accuracy, as shown by improved performance metrics.
A CAD-Tokenizer is a specialized framework designed to convert CAD construction sequences—such as sketches, extrusions, and other parametric operations—into discrete, modality-aware tokens optimized for downstream LLM processing. Unlike general-purpose tokenizers (e.g., byte-pair encoding, word-piece segmentation), which fragment CAD commands into linguistically derived units, a CAD-Tokenizer preserves primitive-level semantic structure, enabling accurate and efficient modeling of geometric and procedural relationships in both text-to-CAD generation and CAD editing workflows (Wang et al., 25 Sep 2025).
1. Motivation and Conceptual Foundations
Computer-Aided Design (CAD) workflows are inherently sequential and primitive-oriented, relying on ordered construction steps (sketches, extrusions, refinements) that can be edited and extended for prototyping. Generic LLM tokenizers decompose these sequences into word-piece fragments, which obscures semantic boundaries and impairs the attention mechanisms needed for reasoning about geometry and structure. The motivation for a CAD-Tokenizer is to create modality-specific tokenization—mapping each CAD operation and parameter to a distinct token—thus aligning the tokenization process with the native structure of CAD data.
This modality specificity allows the model to efficiently compress the procedural CAD history and attend to the essential operations, rather than overfitting to linguistic artifacts, punctuation, or fragmented terms (e.g., splitting “extrusion” into “extr”, “usion”). The approach conjectures that primitive-level tokens foster improved generation quality and editing capabilities by making geometric and procedural dependencies explicit (Wang et al., 25 Sep 2025).
2. Technical Architecture and Tokenization Pipeline
The core architecture of CAD-Tokenizer centers on a sequence-based Vector Quantized Variational Auto-Encoder (VQ-VAE), which processes CAD command sequences at the primitive level. Rather than using global pooling—which typically reduces the entire sequence to a single vector—CAD-Tokenizer introduces primitive-specific pooling layers. Each sketch–extrusion pair is encoded and pooled into its own token, independently of other primitives.
Subsequently, an adapter module aligns these latent tokens (dimension ) with the LLM embedding space (dimension ). The adapter is trained to minimize a reconstruction loss:
where is obtained by mapping discrete tokens into the VQ space using the LLM’s logit and embedding layers.
To guarantee that the generated sequences always obey the strict grammar of CAD operations (e.g., correct ordering of sketch and extrusion commands), a Finite-State Automaton (FSA)-driven constrained decoding strategy is employed during inference. At each generation step, the FSA provides logit masks restricting outputs to grammatically valid tokens, thereby reducing syntactic and semantic errors.
Relevant reconstruction and quantization objectives include:
with denoting the squared Earth Mover’s Distance Loss, and representing the vector quantization loss aggregated per primitive.
3. Modality-Specific Tokenization and Representation
CAD-Tokenizer diverges from native language tokenizers by encoding each CAD instruction—whether it is a sketch, extrusion, or numerical parameter—as a discrete primitive token. Examples of primitive tokens are representations for “line,” “arc,” “circle,” as well as tokens for extrusion depth or feature type. This design yields compact, structure-aware representations more consistent with the operational logic employed by human CAD designers.
This approach supports both the initialization of new prototypes and sequential editing, with the sequence-based VQ-VAE producing per-primitive token pools that mirror actual construction steps. The modality-aware tokenization is beneficial for data compression and improves the trainability and generalization capabilities of LLM backbones tasked with CAD synthesis and editing.
4. Integration in Unified Text-Guided CAD Prototyping
CAD-Tokenizer is applied in unified text-guided CAD prototyping, seamlessly linking Text-to-CAD generation and CAD editing. The pipeline accepts prompts , where is a natural language instruction and an optional existing CAD sequence. The system encodes (or generates new sequences if absent) into compact primitive tokens, concatenates with instructions, and fine-tunes the LLM on this input.
This enables the model to both initialize high-quality CAD objects and accurately modify existing shapes according to editing instructions. The FSA-constrained decoding further ensures syntactic correctness and operational validity.
5. Evaluation Metrics and Empirical Performance
CAD-Tokenizer demonstrates quantitative and qualitative improvements over general-purpose and specialist baselines. Evaluation metrics include F1 scores for sketches () and extrusions (), Chamfer Distance (CD), Coverage (COV), Minimum Matching Distance (MMD), Jensen–Shannon Divergence (JSD), and Invalidity Ratio (IR). Lower CD and IR values, along with higher distributional scores, reflect superior geometric fidelity and semantic completeness.
Qualitative results show more balanced representations, improved instruction following, and higher reliability in both Text-to-CAD and editing tasks. The FSA constraint during inference minimizes syntactic errors, further boosting generation quality.
| Metric | CAD-Tokenizer | Baselines |
|---|---|---|
| High | Moderate | |
| High | Moderate | |
| CD | Low | Higher |
| IR | Low | Higher |
6. Limitations and Future Directions
Current limitations include reduced capacity to model highly complex shapes due to gaps between open-source and private-sector CAD datasets, and the need for more nuanced evaluation metrics tailored to editing quality. The modality-specific tokenization approach establishes a foundation for future research directions, such as:
- Refining spatial and commonsense reasoning within LLM backbones.
- Developing more comprehensive CAD datasets to expand expressivity.
- Advancing evaluation metrics that more closely align with designer priorities, especially in shape preservation and edit validity.
A plausible implication is that extending modality-specific tokenization to additional primitives and operations (e.g., advanced fillets, shells, multi-body interactions) could further increase precision and usability in industrial prototyping scenarios.
7. Significance and Implications
The CAD-Tokenizer paradigm establishes an engineered pipeline—from primitive-level VQ-VAE tokenization and embedding-space alignment to FSA-constrained grammar enforcement—that allows LLMs to handle CAD sequences as structure-preserving, semantically meaningful tokens. This tailored approach addresses the core shortcomings of native LLM tokenizers, leading to more efficient, accurate, and flexible CAD prototyping and editing workflows. These advances hold significance for both academic research in multimodal generative modeling and for industrial adoption in computer-aided design systems (Wang et al., 25 Sep 2025).