Papers
Topics
Authors
Recent
2000 character limit reached

NDTokenizer3D: Unified 3D Vision–Language Model

Updated 29 November 2025
  • NDTokenizer3D is a unified 3D vision–language model that tokenizes scenes using a multi-scale Normal Distributions Transform.
  • It employs a transformer-based decoder to fuse local geometric features with global context, enabling tasks like segmentation and VQA.
  • The model integrates human-interactive prompting and achieves superior performance in 3D referring segmentation, dense captioning, and visual question answering.

NDTokenizer3D is a generalist 3D vision–LLM (VLM) that performs holistic 3D scene tokenization and unifies diverse 3D understanding tasks, including human-interactive prompting and segmentation-mask decoding. Its principal innovation lies in a three-stage pipeline: (1) construction of a multi-scale Normal Distributions Transform (NDT) representation, (2) extraction of geometric features via a 3D encoder, and (3) hierarchical fusion of features through a Multi-Scale NDT Decoder (MSDec). This design compactly encodes global scene context and local geometric structure, enabling seamless integration with LLMs for tasks such as 3D referring segmentation, 3D visual question answering, and 3D dense captioning (Tang et al., 26 Nov 2025).

1. Multi-Scale Normal Distributions Transform (NDT) Representation

NDTokenizer3D begins by partitioning a raw point cloud X={xiR3}i=1Np\mathbf{X} = \{x_i \in \mathbb R^3\}_{i=1}^{N_p} into regular grids at RR scales (s=1,,Rs = 1,\dots,R). Each scale subdivides the scene into NsN_s voxels/cells, {Csj}j=1Ns\{C_s^j\}_{j=1}^{N_s}. The point distribution within each cell is modeled as a 3D Gaussian: p(xCsj)=N(x;μsj,Σsj)p(x \mid C_s^j) = \mathcal N(x; \mu_s^j, \Sigma_s^j) where

μsj=1ni=1nxi,Σsj=1n1i=1n(xiμsj)(xiμsj)T\mu_s^j = \frac{1}{n} \sum_{i=1}^n x_i, \qquad \Sigma_s^j = \frac{1}{n-1} \sum_{i=1}^n (x_i-\mu_s^j)(x_i-\mu_s^j)^{T}

are the sample mean and covariance. Feature vectors for each cell incorporate geometric descriptors, including:

  • μsjR3\mu_s^j \in \mathbb R^3 (mean position)
  • Eigenvalues {λs,1j,λs,2j,λs,3j}\{\lambda_{s,1}^j, \lambda_{s,2}^j, \lambda_{s,3}^j\} of Σsj\Sigma_s^j
  • Trace tr(Σsj)\mathrm{tr}(\Sigma_s^j) or normalized shape features λs,ij/kλs,kj\lambda_{s,i}^j/\sum_k\lambda_{s,k}^j
  • Optionally, RGB color csjR3c_s^j \in \mathbb R^3 projected from cameras

Each cell is thus represented as Csj=[μsj;vec(Σsj);csj]R15C_s^j = [\mu_s^j; \mathrm{vec}(\Sigma_s^j); c_s^j] \in \mathbb R^{15}, forming the multi-scale scene descriptor Cs=[Csj]j=1Ns\mathbf C_s = [C_s^j]_{j=1}^{N_s}.

2. Multi-Scale NDT Decoder (MSDec) and Cross-Scale Fusion

MSDec is a transformer-based decoder stack with RR layers, where each layer rr fuses information from the corresponding grid scale via cross-attention and self-attention. At each layer, the encoder output FrRNr×df\mathbf F_r \in \mathbb R^{N_r \times d_f} supplies Keys and Values, while a set of QQ learnable Queries QrRQ×dm\mathbf Q_r\in\mathbb R^{Q\times d_m} aggregates scene information. The update at each layer proceeds as: Q~r=CrossAttn(Qr,Kr,Vr)\tilde{\mathbf Q}_r = \mathrm{CrossAttn}(\mathbf Q_r, \mathbf K_r, \mathbf V_r)

Q^r=SelfAttn(Q~r)\hat{\mathbf Q}_r = \mathrm{SelfAttn}(\tilde{\mathbf Q}_r)

Qr+1=FFN(Q^r)\mathbf Q_{r+1} = \mathrm{FFN}(\hat{\mathbf Q}_r)

where Keys and Values are projected: Kr=WrKFr\mathbf K_r = W_r^K\,\mathbf F_r, Vr=WrVFr\mathbf V_r = W_r^V\,\mathbf F_r. Initial queries are down-sampled from the finest-scale features: Q1=W1Q(FR)\mathbf Q_1 = W_1^Q(\downarrow\,\mathbf F_R). After cross-scale fusion via attention over all RR scales, the final set QR+1\mathbf Q_{R+1} encodes holistic scene information. Positional encodings PEsj\mathrm{PE}_s^j representing 3D centers and scales are added to per-cell features.

3. Holistic Scene Tokenization and Integration With LLMs

The output QR+1RQ×dm\mathbf Q_{R+1}\in\mathbb R^{Q\times d_m} is projected into KK scene tokens for LLM consumption: Tk=WtFlatten(QR+1)+bt,k=1,,KT_k = W_t\,\mathrm{Flatten}(\mathbf Q_{R+1}) + b_t, \quad k = 1, \dots, K where K=QK = Q and the projection is applied row-wise or using a small MLP. The resulting scene tokens {Tk}\{T_k\} form the visual context EV\mathbf E_V interfacing with the LLM. This enables downstream reasoning and language-vision alignment in tasks such as segmentation, referring expression comprehension, and VQA.

4. Human-Interactive Prompting and Generalized Input Interface

NDTokenizer3D unifies interactive user guidance (points, boxes, masks) in its pipeline. Prompts are rasterized into a binary mask mu{0,1}NRm_u \in \{0,1\}^{N_R} over finest-scale cells. Masked pooling yields a prompt feature: FRP=1jmu[j]j=1NRmu[j]FR[j]\mathbf F_R^P = \frac{1}{\sum_j m_u[j]}\sum_{j=1}^{N_R} m_u[j]\,\mathbf F_R[j] Projected as Q1P=WPQFRP\mathbf Q_1^P = W^Q_P \mathbf F_R^P, it is appended to the query set Q1[Q1;Q1P]\mathbf Q_1 \leftarrow [\mathbf Q_1; \mathbf Q_1^P]. This augmented query propagates through MSDec, allowing incorporation of user guidance in cross-scale feature fusion. The resulting prompt token QR+1P\mathbf Q_{R+1}^P is mapped to a guidance embedding EP\mathbf E_P for the LLM, ensuring consistent semantic alignment between visual and interactive cues.

5. Segmentation-Mask Decoding and Training Objectives

For segmentation, LLMs emit a special [SEG] token with hidden state HS\mathbf H^S. A segmentation head fsf_s produces the initial query Q1S=fs(HS)\mathbf Q_1^S = f_s(\mathbf H^S), which is then fused by MSDec as above. The mask head fmf_m outputs a 3D kernel k\mathbf k, computing per-cell mask logits and probabilities: j=FR[j]k,Mj=σ(j)\ell_j = \mathbf F_R[j] \cdot \mathbf k, \qquad M_j = \sigma(\ell_j) Training leverages multi-objective loss, including:

  • Cross-entropy for semantic classification (Lcls\mathcal L_{cls})
  • Binary cross-entropy and Dice loss for masks (Lm\mathcal L_m)
  • Cosine similarity for CLIP-based semantic alignment (Ls\mathcal L_s)
  • Next-token cross-entropy for language (Lt\mathcal L_t)

Aggregate pre-training and instruction-tuning losses are: L=Lcls+λ1Lm+λ2Ls\mathcal L = \mathcal L_{cls} + \lambda_1\,\mathcal L_{m} + \lambda_2\,\mathcal L_{s}

L=Lt+λ3Lm+λ4Ls(Ha^,Ha)\mathcal L = \mathcal L_t + \lambda_3\,\mathcal L_m + \lambda_4\,\mathcal L_s(\mathbf H^{\hat a}, \mathbf H^{a})

6. Empirical Performance and Ablation Insights

NDTokenizer3D demonstrates competitive performance relative to 3D-LLaVA across a range of benchmarks. Notably, it achieves higher mIoU (+3.3) for 3D Referring Segmentation (Multi3DRefer), improved CiDEr and METEOR scores for 3D VQA on ScanQA, and superior hallucination resilience in the 3D-POPE negative accuracy regime.

Task 3D-LLaVA NDTokenizer3D Δ
Referring Seg (mIoU) 42.7 46.0 +3.3
ScanQA (CiDEr) 92.6 98.6 +6.0
Dense Captioning 78.8 79.0 +0.2
3D-POPE (Random, %) 75.5 84.1 +8.6

Ablations verify:

  • Multi-scale NDT outperforms naïve downsampling
  • Three scales is optimal
  • Approximately 800 queries suffice for effective scene summarization

7. Architectural Innovations and Significance

NDTokenizer3D's design centers on its multi-scale NDT representation, which enables compact, globally contextual, and locally precise encoding of raw point clouds. The MSDec module provides hierarchical cross-scale fusion and serves as the backbone for both holistic scene tokenization and interactive semantic manipulation. This unified approach facilitates integration with LLMs, allowing human-in-the-loop interaction, complex visual reasoning, and segmentation within a single architecture (Tang et al., 26 Nov 2025). A plausible implication is the emergence of architectures capable of flexible 3D visual–language processing under minimal design constraints, and the generalization of this paradigm to further spatially structured data domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to NDTokenizer3D.