Multi-Modal Flow Matching for B-Rep Generation
- Multi-Modal Flow Matching for B-Rep Generation is a generative modeling approach that reformulates CAD B-Reps as compositional k-cell particles for flexible, topology-aware synthesis.
- It utilizes a dual-stage CC-VAE with cross-attention, GCN, and a set transformer to encode geometric features and a rectified flow to deterministically transport latent vectors for both conditional and unconditional generation.
- The framework supports versatile applications including local inpainting, non-manifold synthesis, and scalable reconstruction with improved validity, cyclomatic complexity, and geometric fidelity.
A multi-modal flow matching framework, as realized in "Flatten The Complex: Joint B-Rep Generation via Compositional -Cell Particles" (Lu et al., 25 Jan 2026), refers to a generative modeling paradigm for CAD boundary representations (B-Reps) where both geometric and topological components—across dimensions—are encoded as compositional particles. This framework leverages flow matching in latent space, allowing for unconditional, conditional (e.g., single-view, point cloud), and inpainting tasks with enhanced validity, topological flexibility, and robust editability.
1. Compositional -Cell Particle Representation
Conventional B-Rep modeling adheres to the cell complex structure , explicitly encoding hierarchical relations among 0-cells (vertices), 1-cells (edges), and 2-cells (faces). The framework introduced in (Lu et al., 25 Jan 2026) reorganizes this hierarchy into a set of compositional -cell particles, where each particle consists of:
- : spatial anchor of the cell (e.g., centroid)
- : cell order (vertex, edge, face)
- : a learned latent embedding containing both geometric and graph-structural information
Boundary relations are captured over a Spatial Hasse diagram , with directed links encoding boundary inclusion . Higher-order cells reuse the embeddings of their boundary constituents, with geometry decoded as: for , enabling intrinsic sharing and coupling of local and global B-Rep attributes.
2. CC-VAE: Encoding and Decoding the Particle Set
Cell particle sets are first processed by a two-stage compositional cell variational autoencoder (CC-VAE). The encoder comprises:
- Local geometric injection via cross-attention from dense surface points to each
- A 2-layer GCN over the Hasse relations
- A Set Transformer yielding posterior means and variances for each particle
The decoder reconstructs
- particle positions
- cell types
- adjacency (link) probabilities (via a masked focal loss on boundary inclusion) while simultaneously reconstructing the geometric content of each cell via Eq. (1). The total VAE loss includes reconstruction, binary cross-entropy losses, and Kullback–Leibler regularization:
3. Multi-Modal Rectified Flow Matching
After CC-VAE training, every B-Rep is represented as a fixed-length unordered set of latent vectors . The set generator then models the generative process as a rectified flow, fitting a displacement-based flow model [Liu et al., 2023] that deterministically transports a Gaussian prior to the empirical data law using the ODE: with the flow net trained under mean squared velocity loss: At inference, integrating from to $1$ (unconditional), or for conditioning variable (e.g., image or point cloud), synthesizes new B-Rep structures in latent space.
Conditional generation is handled by a dual-stream transformer backbone based on MM-DiT [SD3], where noisy latent token sequences and frozen condition tokens (e.g., DINOv2 image embeddings, Sonata point cloud features) are simultaneously ingested. The flow model conditions on for both training and inference.
4. Functional Properties and Inference Capabilities
The particle-based set structure and the latent flow allow for:
- Unconditional generation: sampling from the prior and transporting latents via flow yields diverse B-Rep solids.
- Conditional generation: incorporating external modalities (single-view images, point clouds) as condition tokens allows the generation of solids consistent with observed evidence.
- Local inpainting: arbitrary masking of particles (i.e., holding some tokens fixed, such as vertices/edges) during flow sampling enables local or partial completion.
- Non-manifold synthesis: by restricting the token set (e.g., only 0- and 1-cells), the method can directly synthesize wireframe or other non-manifold structures.
5. Experimental Results and Advantages over Prior Methodologies
Quantitative benchmarks on DeepCAD and ABC datasets demonstrate that the multi-modal flow matching framework achieves:
| Dataset | 1-NNA ↓ | MMD ↓ | JSD ↓ | Coverage ↑ | Validity ↑ | CC ↑ |
|---|---|---|---|---|---|---|
| ABC | 63.02 | 1.74 | 0.66 | 64.32 | 66.50 % | 12.92 |
Notable advantages include:
- Higher validity, especially for B-Reps with complex topology (min- validity remains stable as increases, when other models falter for ).
- Robust cyclomatic complexity (loop structure) and geometric fidelity.
- Inference-time scaling: increasing the number of tokens beyond the training regime (e.g., from to or $1024$) improves validity without retraining.
- Versatility in tasks—unconditional generation, conditional reconstruction, local inpainting, and direct wireframe synthesis—without architectural modifications.
Qualitative analysis (as shown in their Figures 6–9) confirm sharp feature preservation, watertightness, editability, as well as the ability to flexibly manipulate and edit B-Rep substructures.
6. Broader Significance and Limitations
By flattening the B-Rep cell complex to a compositional set, jointly learning holistic encoding and decoding with a CC-VAE, and employing rectified flow for generation, this framework overcomes fundamental limitations of hierarchical, cascade-based, or strictly sequential methods:
- Topology-geometry coupling: explicit shared-latent boundaries enforce geometric and topological consistency across all orders.
- Parallelism and flexibility: the set-based approach supports global reasoning, unrestricted editing, and scalable synthesis.
- Failure modes: while validity and editability are improved, the set-transformer’s computational cost grows with , and excessive masking or very high token counts may challenge decoder capacity.
Future work suggested in (Lu et al., 25 Jan 2026) includes scaling to even larger B-Reps, more advanced masking schemes, and extending the approach to richer input/output modalities and assemblies. The multi-modal flow matching framework currently represents the most holistic, edit-friendly, and contextually robust paradigm for B-Rep generative modeling.