NDTokenizer3D: Unified 3D Vision–Language Model

Updated 29 November 2025

NDTokenizer3D is a unified 3D vision–language model that tokenizes scenes using a multi-scale Normal Distributions Transform.
It employs a transformer-based decoder to fuse local geometric features with global context, enabling tasks like segmentation and VQA.
The model integrates human-interactive prompting and achieves superior performance in 3D referring segmentation, dense captioning, and visual question answering.

NDTokenizer3D is a generalist 3D vision–LLM (VLM) that performs holistic 3D scene tokenization and unifies diverse 3D understanding tasks, including human-interactive prompting and segmentation-mask decoding. Its principal innovation lies in a three-stage pipeline: (1) construction of a multi-scale Normal Distributions Transform (NDT) representation, (2) extraction of geometric features via a 3D encoder, and (3) hierarchical fusion of features through a Multi-Scale NDT Decoder (MSDec). This design compactly encodes global scene context and local geometric structure, enabling seamless integration with LLMs for tasks such as 3D referring segmentation, 3D visual question answering, and 3D dense captioning (Tang et al., 26 Nov 2025).

1. Multi-Scale Normal Distributions Transform (NDT) Representation

NDTokenizer3D begins by partitioning a raw point cloud $\mathbf{X} = \{x_i \in \mathbb R^3\}_{i=1}^{N_p}$ into regular grids at $R$ scales ( $s = 1,\dots,R$ ). Each scale subdivides the scene into $N_s$ voxels/cells, $\{C_s^j\}_{j=1}^{N_s}$ . The point distribution within each cell is modeled as a 3D Gaussian: $p(x \mid C_s^j) = \mathcal N(x; \mu_s^j, \Sigma_s^j)$ where

$\mu_s^j = \frac{1}{n} \sum_{i=1}^n x_i, \qquad \Sigma_s^j = \frac{1}{n-1} \sum_{i=1}^n (x_i-\mu_s^j)(x_i-\mu_s^j)^{T}$

are the sample mean and covariance. Feature vectors for each cell incorporate geometric descriptors, including:

$\mu_s^j \in \mathbb R^3$ (mean position)
Eigenvalues $\{\lambda_{s,1}^j, \lambda_{s,2}^j, \lambda_{s,3}^j\}$ of $\Sigma_s^j$
Trace $\mathrm{tr}(\Sigma_s^j)$ or normalized shape features $\lambda_{s,i}^j/\sum_k\lambda_{s,k}^j$
Optionally, RGB color $c_s^j \in \mathbb R^3$ projected from cameras

Each cell is thus represented as $C_s^j = [\mu_s^j; \mathrm{vec}(\Sigma_s^j); c_s^j] \in \mathbb R^{15}$ , forming the multi-scale scene descriptor $\mathbf C_s = [C_s^j]_{j=1}^{N_s}$ .

2. Multi-Scale NDT Decoder (MSDec) and Cross-Scale Fusion

MSDec is a transformer-based decoder stack with $R$ layers, where each layer $r$ fuses information from the corresponding grid scale via cross-attention and self-attention. At each layer, the encoder output $\mathbf F_r \in \mathbb R^{N_r \times d_f}$ supplies Keys and Values, while a set of $Q$ learnable Queries $\mathbf Q_r\in\mathbb R^{Q\times d_m}$ aggregates scene information. The update at each layer proceeds as: $\tilde{\mathbf Q}_r = \mathrm{CrossAttn}(\mathbf Q_r, \mathbf K_r, \mathbf V_r)$

$\hat{\mathbf Q}_r = \mathrm{SelfAttn}(\tilde{\mathbf Q}_r)$

$\mathbf Q_{r+1} = \mathrm{FFN}(\hat{\mathbf Q}_r)$

where Keys and Values are projected: $\mathbf K_r = W_r^K\,\mathbf F_r$ , $\mathbf V_r = W_r^V\,\mathbf F_r$ . Initial queries are down-sampled from the finest-scale features: $\mathbf Q_1 = W_1^Q(\downarrow\,\mathbf F_R)$ . After cross-scale fusion via attention over all $R$ scales, the final set $\mathbf Q_{R+1}$ encodes holistic scene information. Positional encodings $\mathrm{PE}_s^j$ representing 3D centers and scales are added to per-cell features.

3. Holistic Scene Tokenization and Integration With LLMs

The output $\mathbf Q_{R+1}\in\mathbb R^{Q\times d_m}$ is projected into $K$ scene tokens for LLM consumption: $T_k = W_t\,\mathrm{Flatten}(\mathbf Q_{R+1}) + b_t, \quad k = 1, \dots, K$ where $K = Q$ and the projection is applied row-wise or using a small MLP. The resulting scene tokens $\{T_k\}$ form the visual context $\mathbf E_V$ interfacing with the LLM. This enables downstream reasoning and language-vision alignment in tasks such as segmentation, referring expression comprehension, and VQA.

4. Human-Interactive Prompting and Generalized Input Interface

NDTokenizer3D unifies interactive user guidance (points, boxes, masks) in its pipeline. Prompts are rasterized into a binary mask $m_u \in \{0,1\}^{N_R}$ over finest-scale cells. Masked pooling yields a prompt feature: $\mathbf F_R^P = \frac{1}{\sum_j m_u[j]}\sum_{j=1}^{N_R} m_u[j]\,\mathbf F_R[j]$ Projected as $\mathbf Q_1^P = W^Q_P \mathbf F_R^P$ , it is appended to the query set $\mathbf Q_1 \leftarrow [\mathbf Q_1; \mathbf Q_1^P]$ . This augmented query propagates through MSDec, allowing incorporation of user guidance in cross-scale feature fusion. The resulting prompt token $\mathbf Q_{R+1}^P$ is mapped to a guidance embedding $\mathbf E_P$ for the LLM, ensuring consistent semantic alignment between visual and interactive cues.

5. Segmentation-Mask Decoding and Training Objectives

For segmentation, LLMs emit a special [SEG] token with hidden state $\mathbf H^S$ . A segmentation head $f_s$ produces the initial query $\mathbf Q_1^S = f_s(\mathbf H^S)$ , which is then fused by MSDec as above. The mask head $f_m$ outputs a 3D kernel $\mathbf k$ , computing per-cell mask logits and probabilities: $\ell_j = \mathbf F_R[j] \cdot \mathbf k, \qquad M_j = \sigma(\ell_j)$ Training leverages multi-objective loss, including:

Cross-entropy for semantic classification ( $\mathcal L_{cls}$ )
Binary cross-entropy and Dice loss for masks ( $\mathcal L_m$ )
Cosine similarity for CLIP-based semantic alignment ( $\mathcal L_s$ )
Next-token cross-entropy for language ( $\mathcal L_t$ )

Aggregate pre-training and instruction-tuning losses are: $\mathcal L = \mathcal L_{cls} + \lambda_1\,\mathcal L_{m} + \lambda_2\,\mathcal L_{s}$

$\mathcal L = \mathcal L_t + \lambda_3\,\mathcal L_m + \lambda_4\,\mathcal L_s(\mathbf H^{\hat a}, \mathbf H^{a})$

6. Empirical Performance and Ablation Insights

NDTokenizer3D demonstrates competitive performance relative to 3D-LLaVA across a range of benchmarks. Notably, it achieves higher mIoU (+3.3) for 3D Referring Segmentation (Multi3DRefer), improved CiDEr and METEOR scores for 3D VQA on ScanQA, and superior hallucination resilience in the 3D-POPE negative accuracy regime.

Task	3D-LLaVA	NDTokenizer3D	Δ
Referring Seg (mIoU)	42.7	46.0	+3.3
ScanQA (CiDEr)	92.6	98.6	+6.0
Dense Captioning	78.8	79.0	+0.2
3D-POPE (Random, %)	75.5	84.1	+8.6

Ablations verify:

Multi-scale NDT outperforms naïve downsampling
Three scales is optimal
Approximately 800 queries suffice for effective scene summarization

7. Architectural Innovations and Significance

NDTokenizer3D's design centers on its multi-scale NDT representation, which enables compact, globally contextual, and locally precise encoding of raw point clouds. The MSDec module provides hierarchical cross-scale fusion and serves as the backbone for both holistic scene tokenization and interactive semantic manipulation. This unified approach facilitates integration with LLMs, allowing human-in-the-loop interaction, complex visual reasoning, and segmentation within a single architecture (Tang et al., 26 Nov 2025). A plausible implication is the emergence of architectures capable of flexible 3D visual–language processing under minimal design constraints, and the generalization of this paradigm to further spatially structured data domains.

PDF Markdown Chat (Pro)

References (1)

Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to NDTokenizer3D.