NDTokenizer3D: Unified 3D Vision–Language Model
- NDTokenizer3D is a unified 3D vision–language model that tokenizes scenes using a multi-scale Normal Distributions Transform.
- It employs a transformer-based decoder to fuse local geometric features with global context, enabling tasks like segmentation and VQA.
- The model integrates human-interactive prompting and achieves superior performance in 3D referring segmentation, dense captioning, and visual question answering.
NDTokenizer3D is a generalist 3D vision–LLM (VLM) that performs holistic 3D scene tokenization and unifies diverse 3D understanding tasks, including human-interactive prompting and segmentation-mask decoding. Its principal innovation lies in a three-stage pipeline: (1) construction of a multi-scale Normal Distributions Transform (NDT) representation, (2) extraction of geometric features via a 3D encoder, and (3) hierarchical fusion of features through a Multi-Scale NDT Decoder (MSDec). This design compactly encodes global scene context and local geometric structure, enabling seamless integration with LLMs for tasks such as 3D referring segmentation, 3D visual question answering, and 3D dense captioning (Tang et al., 26 Nov 2025).
1. Multi-Scale Normal Distributions Transform (NDT) Representation
NDTokenizer3D begins by partitioning a raw point cloud into regular grids at scales (). Each scale subdivides the scene into voxels/cells, . The point distribution within each cell is modeled as a 3D Gaussian: where
are the sample mean and covariance. Feature vectors for each cell incorporate geometric descriptors, including:
- (mean position)
- Eigenvalues of
- Trace or normalized shape features
- Optionally, RGB color projected from cameras
Each cell is thus represented as , forming the multi-scale scene descriptor .
2. Multi-Scale NDT Decoder (MSDec) and Cross-Scale Fusion
MSDec is a transformer-based decoder stack with layers, where each layer fuses information from the corresponding grid scale via cross-attention and self-attention. At each layer, the encoder output supplies Keys and Values, while a set of learnable Queries aggregates scene information. The update at each layer proceeds as:
where Keys and Values are projected: , . Initial queries are down-sampled from the finest-scale features: . After cross-scale fusion via attention over all scales, the final set encodes holistic scene information. Positional encodings representing 3D centers and scales are added to per-cell features.
3. Holistic Scene Tokenization and Integration With LLMs
The output is projected into scene tokens for LLM consumption: where and the projection is applied row-wise or using a small MLP. The resulting scene tokens form the visual context interfacing with the LLM. This enables downstream reasoning and language-vision alignment in tasks such as segmentation, referring expression comprehension, and VQA.
4. Human-Interactive Prompting and Generalized Input Interface
NDTokenizer3D unifies interactive user guidance (points, boxes, masks) in its pipeline. Prompts are rasterized into a binary mask over finest-scale cells. Masked pooling yields a prompt feature: Projected as , it is appended to the query set . This augmented query propagates through MSDec, allowing incorporation of user guidance in cross-scale feature fusion. The resulting prompt token is mapped to a guidance embedding for the LLM, ensuring consistent semantic alignment between visual and interactive cues.
5. Segmentation-Mask Decoding and Training Objectives
For segmentation, LLMs emit a special [SEG] token with hidden state . A segmentation head produces the initial query , which is then fused by MSDec as above. The mask head outputs a 3D kernel , computing per-cell mask logits and probabilities: Training leverages multi-objective loss, including:
- Cross-entropy for semantic classification ()
- Binary cross-entropy and Dice loss for masks ()
- Cosine similarity for CLIP-based semantic alignment ()
- Next-token cross-entropy for language ()
Aggregate pre-training and instruction-tuning losses are:
6. Empirical Performance and Ablation Insights
NDTokenizer3D demonstrates competitive performance relative to 3D-LLaVA across a range of benchmarks. Notably, it achieves higher mIoU (+3.3) for 3D Referring Segmentation (Multi3DRefer), improved CiDEr and METEOR scores for 3D VQA on ScanQA, and superior hallucination resilience in the 3D-POPE negative accuracy regime.
| Task | 3D-LLaVA | NDTokenizer3D | Δ |
|---|---|---|---|
| Referring Seg (mIoU) | 42.7 | 46.0 | +3.3 |
| ScanQA (CiDEr) | 92.6 | 98.6 | +6.0 |
| Dense Captioning | 78.8 | 79.0 | +0.2 |
| 3D-POPE (Random, %) | 75.5 | 84.1 | +8.6 |
Ablations verify:
- Multi-scale NDT outperforms naïve downsampling
- Three scales is optimal
- Approximately 800 queries suffice for effective scene summarization
7. Architectural Innovations and Significance
NDTokenizer3D's design centers on its multi-scale NDT representation, which enables compact, globally contextual, and locally precise encoding of raw point clouds. The MSDec module provides hierarchical cross-scale fusion and serves as the backbone for both holistic scene tokenization and interactive semantic manipulation. This unified approach facilitates integration with LLMs, allowing human-in-the-loop interaction, complex visual reasoning, and segmentation within a single architecture (Tang et al., 26 Nov 2025). A plausible implication is the emergence of architectures capable of flexible 3D visual–language processing under minimal design constraints, and the generalization of this paradigm to further spatially structured data domains.