Multi-Scale Normal Distributions Transform

Updated 29 November 2025

Multi-Scale Normal Distributions Transform is a method that divides 3D space into voxel cells, modeling local geometry with Gaussian distributions and capturing both spatial and color features.
The approach employs a transformer decoder to fuse cross-scale features, significantly improving the accuracy of 3D vision-language tasks.
It underpins NDTokenizer3D by enabling precise segmentation, visual question answering, and interactive prompting in complex 3D environments.

NDTokenizer3D is a generalist 3D vision–LLM that transforms raw, high-resolution point clouds into holistic scene tokens for downstream language modeling tasks. It employs a three-stage pipeline: constructing multi-scale Normal Distributions Transform (NDT) representations, extracting per-cell 3D geometric features, and fusing these cross-scale features using a Multi-Scale NDT Decoder (MSDec) to produce compact, information-rich embeddings. These holistic scene tokens natively support a variety of 3D understanding tasks—referring segmentation, visual question answering (VQA), dense captioning, situated QA—and human interaction primitives such as point-, box-, and mask-based prompting, thereby unifying spatial and language reasoning within a single architecture (Tang et al., 26 Nov 2025).

1. Multi-Scale NDT Representation

NDTokenizer3D initiates its pipeline by partitioning the input point cloud $\mathbf{X} = \{x_i \in \mathbb{R}^3\}_{i=1}^{N_p}$ into $R$ regular grid scales ( $s = 1,\ldots,R$ ), each subdividing the space into $N_s$ voxel cells $\{C_s^j\}_{j=1}^{N_s}$ . Within each cell $C_s^j$ , the local geometry is modeled by a 3D Gaussian distribution:

$p(x \mid C_s^j) = \mathcal{N}(x; \mu_s^j, \Sigma_s^j),$

where

$\mu_s^j = \frac{1}{n} \sum_{i=1}^n x_i,\qquad \Sigma_s^j = \frac{1}{n-1}\sum_{i=1}^n (x_i-\mu_s^j)(x_i-\mu_s^j)^{T}.$

From each cell, geometric descriptors are extracted, including the mean $\mu_s^j$ , the three eigenvalues $\{\lambda_{s,1}^j, \lambda_{s,2}^j, \lambda_{s,3}^j\}$ of the covariance $\Sigma_s^j$ , and optional normalized shape features. Color features $c_s^j \in \mathbb{R}^3$ can be projected from RGB cameras. The resulting per-cell vector

$C_s^j = [\mu_s^j; \mathrm{vec}(\Sigma_s^j); c_s^j] \in \mathbb{R}^{15}$

captures both spatial and appearance attributes. For each scale, all cell descriptors are stacked as $\mathbf{C}_s$ .

2. Multi-Scale NDT Decoder (MSDec) and Cross-Scale Fusion

The MSDec module consists of a transformer decoder stack with $R$ layers. At each layer $r$ , the scale- $r$ encoder outputs $\mathbf{F}_r \in \mathbb{R}^{N_r \times d_f}$ serve as Key–Value pairs, and a set of learnable queries $\mathbf{Q}_r \in \mathbb{R}^{Q \times d_m}$ serve as Query vectors. The queries are updated by sequential cross-attention to each scale, followed by self-attention and a feed-forward network:

$\tilde{\mathbf Q}_r = \mathrm{CrossAttn}(\mathbf Q_r, \mathbf K_r, \mathbf V_r), \quad \hat{\mathbf Q}_r = \mathrm{SelfAttn}(\tilde{\mathbf Q}_r), \quad \mathbf Q_{r+1} = \mathrm{FFN}(\hat{\mathbf Q}_r)$

with

$\mathbf K_r = W_r^K\mathbf F_r, \quad \mathbf V_r = W_r^V\mathbf F_r.$

Cross-scale fusion in MSDec is mathematically formalized as

$\mathbf F^{(\ell+1)} = \mathrm{Fusion}\Bigl(\mathbf F_1^{(\ell)},\mathbf F_2^{(\ell)},\ldots,\mathbf F_S^{(\ell)}\Bigr)$

with

$\mathrm{Fusion}(F_1,\dots,F_S) = \sum_{s=1}^S \mathrm{Attn}(QW_q, F_sW_k, F_sW_v).$

Learnable positional embeddings $\mathrm{PE}_s^j$ encoding 3D center and scale are incorporated into encoder features for spatial awareness. Ablation results confirm multi-scale NDT markedly surpasses naïve downsampling and three scales are optimal.

3. Holistic Scene Tokenization and LLM Integration

Following $R$ MSDec layers, the decoder output $\mathbf Q_{R+1} \in \mathbb{R}^{Q \times d_m}$ is linearly projected into $K$ scene tokens for the LLM: $T_k = W_t \mathrm{Flatten}(\mathbf Q_{R+1}) + b_t,\quad k=1,\ldots,K,\quad K=Q.$ These tokens $\{T_k\}$ compose the visual context $\mathbf E_V$ for multimodal LLM endpoints, serving as high-level scene encoding for vision–language tasks.

4. Unified Interface for Human-Interactive Prompting and Segmentation Decoding

NDTokenizer3D’s architecture natively supports human-interactive input (points, bounding boxes, masks). A prompt is rasterized into a mask $m_u \in \{0,1\}^{N_R}$ and pooled from fine-scale features:

$\mathbf F_R^P = \frac{1}{\sum_j m_u[j]}\sum_{j=1}^{N_R} m_u[j]\mathbf{F}_R[j]$

This vector is projected to $\mathbf Q_1^P = W_P^Q\mathbf F_R^P$ and concatenated to initial queries, propagating prompt-guided information throughout the MSDec hierarchy. The final prompt token $\mathbf Q_{R+1}^P$ is projected to form a guidance embedding $\mathbf E_P$ for the LLM.

For segmentation decoding, when the LLM emits [SEG], the hidden state $\mathbf H^S$ triggers a segmentation head $f_s$ generating a query $\mathbf Q_1^S = f_s(\mathbf H^S)$ . MSDec processes this query to $\mathbf Q_{R+1}^S$ , producing a mask kernel $\mathbf k$ ; mask logits and probabilities are computed by

$\ell_j = \mathbf F_R[j]\cdot \mathbf k,\quad M_j = \sigma(\ell_j)$

with sigmoid activation. Multiple loss components are applied for pre-training and instruction tuning:

Cross-entropy for semantic classification: $\mathcal{L}_{cls}$
Binary cross-entropy + Dice loss for masks: $\mathcal{L}_{m}$
Cosine-similarity alignment to CLIP features: $\mathcal{L}_s$
Token cross-entropy for language: $\mathcal{L}_t$

5. Performance Benchmarks and Comparative Evaluation

NDTokenizer3D demonstrates competitive and state-of-the-art metrics across key 3D vision–language benchmarks (relative to generalist competitor 3D-LLaVA):

Task/Metric	3D-LLaVA	NDTokenizer3D	Δ
3D Ref Seg (Multi3DRefer mIoU)	42.7	46.0	+3.3
3D VQA (ScanQA, CiDEr/B-4/M/R)	92.6/17.1/18.4/43.1	98.6/17.0/19.4/44.9	see table
Situated QA (SQA3D EM/EM-R)	54.5/56.6	54.4/57.1
Dense Captioning (Scan2Cap IoU 0.5)	78.8/36.9/27.1/57.7	79.0/36.7/27.1/57.7
Hallucination (3D-POPE neg. accuracy)	Random: 75.5%	84.1%
	Popular: 66.9%	75.5%
	Adversarial:63.1%	72.0%

Ablative studies indicate that multi-scale NDT is superior to conventional downsampling, three scales yield optimal results, and approximately 800 queries suffice.

6. Architectural Innovations and Significance

NDTokenizer3D introduces several innovations:

Multi-scale NDT representation: Compactly encodes both global context and local geometry for raw 3D input.
MSDec framework: Hierarchical transformer decoder designed for efficient cross-scale feature fusion; supports unified handling of vision, spatial reasoning, segmentation, and interaction.
LLM integration: Scene tokens and prompt-based guidance seamlessly bridge the gap between 3D spatial input and natural language reasoning in a multimodal model.

This approach enables a unified, flexible interface for 3D understanding, denoting a methodological advance in generalist 3D VLMs.

7. Context and Implications for 3D Vision-Language Research

NDTokenizer3D’s pipeline unifies scene tokenization, multi-scale geometric representation, and human-interactive prompting in a single architecture, supporting both holistic scene understanding and precise segmentation. Intrinsic support for language-level reasoning over 3D spatial data advances the state-of-the-art in 3D referring tasks, VQA, dense captioning, and hallucination resistance. A plausible implication is wider adoption of multi-scale Gaussian cell representations and transformer-based cross-scale fusion modules in next-generation 3D VLMs aimed at generalist multimodal understanding (Tang et al., 26 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-Scale Normal Distributions Transform (NDT).