Part-X-MLLM: 3D Part-Aware Multimodal LLM

Updated 20 November 2025

The paper introduces a unified 3D part-aware LLM that formulates recognition, captioning, question answering, hierarchical decomposition, and editing as structured token generation using an explicit context-free grammar.
It employs a dual-encoder architecture that disentangles geometric structure from semantic appearance and an autoregressive decoder that outputs explicit part commands for seamless integration with downstream 3D geometry engines.
Experimental results demonstrate state-of-the-art performance in segmentation, captioning, and part-level QA, outperforming competing models like PointLLM and ShapeLLM with improved precision and generality.

Part-X-MLLM is a unified, executable 3D part-aware multimodal LLM that formulates a spectrum of 3D tasks—including recognition, captioning, question answering, hierarchical decomposition, and part-level editing—as generation of structured token sequences corresponding to a context-free grammar. The model is distinguished by its dual-encoder design for disentangling geometric structure from semantic appearance, an autoregressive decoder trained to produce programs in an explicit part-based grammar, and a universal output interface that enables integration with downstream geometry engines for mesh synthesis and localized modification. Part-X-MLLM achieves state-of-the-art (SOTA) results in part-aware 3D reasoning, compositional generation, segmentation, captioning, and interactive editing, superseding prior art such as PointLLM and ShapeLLM in both precision and generality (Wang et al., 17 Nov 2025).

1. Structured Programmatic Interface for 3D Part Reasoning

Part-X-MLLM recasts 3D multimodal interaction as the task of generating an executable plan in a small, explicit grammar over parts. The model produces sequences of tokens corresponding to statements for bounding-box definition, part labeling, addition, deletion, and modification. In its context-free grammar:

Each part is specified by a <boxs> token, six quantized coordinate tokens (range [0,127] for x, y, z min/max), an optional label or semantic description, and a closing <boxe>.
Edit commands (<adds>, <dels>, <mods>, with matching <adde>, <dele>, <mode>) operate over operand lists containing boxes or persistent part references.
The program structure enables persistent part identifiers (e.g., <Part_3>) for consistent reference throughout downstream tasks, supporting compositionality, explicit instruction-following, and direct geometric manipulation.
Quantization of 3D coordinates $x$ (in $[-1,1]$ ) uses $q(x) = \mathrm{round}\left(\frac{x+1}{2}(K-1)\right)$ , with $K=128$ ; dequantization recovers the continuous value.

This interface decouples symbolic planning from geometric synthesis, allowing Part-X-MLLM outputs to drive part-based mesh generators (e.g., OmniPart [1]), editors (Nano3D, VoxHammer), and captioning modules in a modular, model-agnostic fashion (Wang et al., 17 Nov 2025).

2. Model Architecture: Dual Encoders and Autoregressive Decoding

2.1. Dual-Encoder Pathways

To separate geometric and appearance cues, Part-X-MLLM employs two encoders:

Structure Encoder $E_s$ : Processes raw 3D point clouds (XYZ coordinates and normals, e.g., $40960 \times 6$ matrix). Downsamples to 2048 tokens via point cloud VAE; initialized from Hunyuan 3D VAE.
Semantic Encoder $E_a$ : Encodes color attributes (RGB + geometry, $10240 \times 6$ samples), also to 2048 tokens.

This bifurcation systematically eliminates the "structural–semantic conflict" inherent in single-path designs, yielding more accurate part detection and richer semantic descriptions.

2.2. Decoder and Attention

The fused encoder outputs $[E_s(X); E_a(X); \text{TokenEmb}(prompt)]$ serve as cross-attention context to a decoder-only transformer (initial weights from Qwen 2.5 VL). Each decoding layer alternates masked self-attention over generated tokens and standard cross-attention to the concatenated encoder tokens. The final softmax layer produces both regular vocabulary and special grammar tokens.

Mathematically, at transformer layer $\ell$ :

$Q = W_Q h^{\ell-1}$
$K, V = W_K [h^{\ell-1}; E_s; E_a]$ , $W_V [h^{\ell-1}; E_s; E_a]$
$\mathrm{Attn}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^\top}{\sqrt{d}}\right) V$

This design supports autoregressive generation—execution of the structured program—based on RGB point cloud input and natural-language queries (Wang et al., 17 Nov 2025).

3. Pre-Training and Instruction Tuning Regime

3.1. Pre-Training

The initial stage (geometry-only pretraining) uses the structure encoder to predict quantized bounding box programs from raw point clouds. The objective: $\mathcal{L}_{\text{pre}} = -\sum_{t=1}^T \log P(t \mid t_{<t},\,E_s(X))$ Training uses 3.6 million objects, each comprising 40960 points with normals, for 10 epochs on 64×A100 GPUs.

3.2. Instruction Tuning

Subsequent instruction tuning introduces the semantic encoder and expands the vocabulary with over 50 special tokens to support the grammar. The loss function is unchanged: $\mathcal{L}_{\text{tune}} = -\sum_{t=1}^L \log P(t \mid t_{<t},\,E_s(X),\,E_a(X),\,\text{Prompt})$ Only $E_a$ , the decoder transformer, and new token embeddings are updated. Training utilizes $\sim$ 4.3 million instruction-target pairs spanning 11 task families, including box listing, part grounding, box-to-text, QA, and edit programs. $E_s$ and base LLM embeddings are frozen to preserve geometric and general language capacity.

4. Token Sequence Output and Downstream Integration

The output is a valid token sequence in the explicit grammar, encoding part hierarchy, structure, and edit operations. Example program:

<boxs> 12 34 56 78 90 12 <boxe> leg
<boxs> 15 31 65 80 95 35 <boxe> seat
<adds>
  <boxs> 10 20 30 40 50 60 <boxe> armrest
<adde>

Each part receives a persistent reference token (e.g. <Part_2>) for later use. These outputs control various geometry-aware downstream modules:

Part-Aware Generation: The predicted boxes and optional captions parameterize per-part mesh synthesis layers (e.g. OmniPart diffusion head), producing editable assemblies.
Localized Editing: By emitting <dels>...<dele> commands (along with corresponding bounding boxes), the model enables direct part removal via geometry editors that operate over cuboid masks.
Compositional QA and Captioning: The same interface supports part-level question answering and structured 3D captioning, allowing users to ask, for instance, "Which part is the hinge?" with responses grounded in the persistent part tokens.

This "language → program → mask → geometry engine" chain provides a single interface for unified, fine-grained control (Wang et al., 17 Nov 2025).

5. Experimental Results and Model Comparisons

Part-X-MLLM was benchmarked on the UniPart-Bench dataset (400 held-out objects, ~23 parts each) and evaluated on metrics including Voxel Recall, Voxel IoU, Bounding Box IoU, SBERT, SimCSE, BLEU-1, ROUGE-L, and METEOR for part-aware QA/captioning.

5.1. Bounding Box Generation (UniPart-Bench)

Method	Voxel Recall (%)	Voxel IoU (%)	BBox IoU (%)
PartField	69.65	46.04	37.33
OmniPart (SAM Mask)	68.33	43.34	34.33
Part-X-MLLM (Ours)	74.11	48.74	42.55

5.2. Encoder Ablation (IoU for Pure Box Listing)

Model	IoU (%)
Dual (Ours)	75.53
Single	68.47

5.3. Part QA (UniPart-Bench)

Model	SBERT	SimCSE	BLEU-1	ROUGE-L	METEOR
PointLLM-13B	56.36	51.47	21.40	29.16	21.80
ShapeLLM-13B	61.19	57.26	23.32	32.56	24.45
Part-X-MLLM (Ours)	78.98	84.25	40.54	42.26	34.24

5.4. Object Captioning (UniPart-Bench)

Model	SBERT	SimCSE	BLEU-1	ROUGE-L	METEOR
PointLLM-13B	43.51	43.12	13.54	15.74	17.45
ShapeLLM-13B	25.15	27.14	11.77	12.14	12.84
Part-X-MLLM (Ours)	53.82	51.97	36.04	38.11	30.71

Ablation experiments demonstrate that the dual-encoder architecture is critical for maximizing intersection-over-union (IoU) and that end-to-end symbolic planning significantly improves both localization and QA performance compared to prior scene-level 3D LLMs.

6. Limitations and Prospective Developments

Sequence Length and Decoding Latency: Detailed assemblies produce lengthy token sequences, potentially slowing autoregressive decoding. Hierarchical or grouped representations offer partial mitigation.
Shallow Face Assignment: Current assignments from bounding boxes to mesh faces use simple geometric heuristics; integrating richer geometric or texture signals could substantially improve face-level fidelity.
Language and Lifelong Learning: Freezing the base LLM during instruction-tuning preserves core language skills, but extensive 3D supervision may subtly bias general-purpose language priors. Continual or dynamic adaptation strategies remain to be explored.
Hierarchical and Parametric Extensions: Potential future directions include extending the grammar for procedural assembly rules, integrating CAD-style parametric primitives, and differentiably coupling the decoder with mesh synthesis engines.

By reframing 3D understanding, generation, and editing as program synthesis in a structured, part-centric language, Part-X-MLLM establishes a universal interface for interactive, semantically precise 3D AI systems (Wang et al., 17 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model (2025)

Follow Topic

Get notified by email when new papers are published related to Part-X-MLLM.