MSDec: Multi-Scale Decoder for 3D Vision

Updated 29 November 2025

The paper introduces MSDec, a transformer-based decoder that fuses multi-scale NDT features to generate compact scene tokens for 3D tasks.
It employs voxel-based multi-scale feature extraction and cross-attention mechanisms to capture both fine details and global context in point clouds.
The architecture bridges 3D spatial context with large language models, facilitating advanced applications in segmentation, VQA, and dense captioning.

The Multi-Scale NDT Decoder (MSDec) is a @@@@1@@@@ module introduced within the NDTokenizer3D framework for generalist 3D vision-LLMs. MSDec is designed to fuse multi-scale Normal Distributions Transform (NDT) features from high-resolution point clouds into compact scene tokens, enabling a unified interface for language-level reasoning, interactive user prompting, and segmentation. This architecture supports general-purpose 3D scene understanding tasks such as referring segmentation, visual question answering, and dense captioning, and functions as a bridge between detailed 3D spatial context and LLMs (Tang et al., 26 Nov 2025).

1. Architectural Composition

MSDec operates as a transformer-style decoder, processing multi-scale 3D features extracted from the NDT representation. The architecture has three main parts:

Initial Queries: The process begins with learnable query tokens $Q_1 \in \mathbb{R}^{M \times d}$ , with $M$ as the number of scene tokens (commonly 850) and $d$ as the hidden dimension. $Q_1$ is initialized via a linear projection of a down-sampled subset of the finest-scale features: $Q_1 \leftarrow W^1_Q (\downarrow F_R)$ , where $F_R$ represents the finest-scale features, and $\downarrow$ denotes subsampling.
Cross-Scale Fusion Blocks (Decoder Layers): For each scale $r=1...R$ , a transformer decoder layer is applied, utilizing the scale- $r$ features $F_r$ as Keys/Values and the query tokens as Queries. Each layer encompasses: 1) Cross-attention between $Q_r$ and $(F_r,\text{ as }K,V)$ 2) Self-attention among updated queries 3) A position-wise feed-forward network (FFN)

The outputs are iteratively refined across layers, yielding $Q_{R+1}=Q_{\text{out}}\in \mathbb{R}^{M \times d}$ , which encodes fused, multi-scale scene representation.

Token Projection Heads:
- $f_{mm}$ : A 2-layer MLP for multimodal alignment, projecting $Q_{\text{out}}$ to the final scene tokens $E_V$ for LLM consumption.
- $f_{m}$ : Instance segmentation classification and mask head used during pretraining.
- $f_{s}$ : Segmentation-query head active during instruction-tuning for mapping an LLM [SEG] token’s hidden state into an MSDec query.

2. Mathematical Formulation

2.1 Multi-Scale Feature Extraction

At each scale $r$ , the point cloud is partitioned into $N_r$ voxels. For the $j^\text{th}$ cell, the mean $\mu_r^j$ , covariance $\Sigma_r^j$ , and average cell intensity $c_r^j$ (across views) are computed:

$\mu_r^j = \frac{1}{n}\sum_{i=1}^n x_i$

$\Sigma_r^j = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu_r^j)(x_i - \mu_r^j)^T$

$c_r^j = \text{average}_{k} [I_k(u_k, v_k)],\;\; [u_k, v_k]^T = P(\mu_r^j|k)$

The cell descriptor $C_r^j = [\mu_r^j; \text{vec}(\Sigma_r^j); c_r^j] \in \mathbb{R}^{15}$ is processed by a 3D encoder $\Phi$ (often a point-transformer) to yield $F_r = \Phi(\{C_r^j\}_{j=1...N_r}) \in \mathbb{R}^{N_r \times d_f}$ .

2.2 Cross-Scale Fusion Mechanism

Feature fusion across scales is characterized as:

$F' = \text{Fusion}(F^{(1)}, \dots, F^{(S)}) = \sum_{s=1}^S W_s[\phi(F^{(s)})] + b_s$

where $\phi$ is a projection (via $W^K_s$ and $W^V_s$ for Keys/Values). In transformer notation:

$\text{CrossAttn}(Q, K, V) = \text{Softmax}\left(\frac{(Q W_q)(K W_k)^T}{\sqrt{d}}\right) (V W_v)$

This nested cross-attention across scales enables the flow of global context at coarse levels and fine detail refinement at deep layers.

3. Scene Token Generation and Consumption

After R decoder layers, $Q_{\text{out}} \in \mathbb{R}^{M \times d}$ is projected via $f_{mm}$ to $E_V \in \mathbb{R}^{M \times d_{LLM}}$ , producing the final scene tokens. These are concatenated before the LLM’s text tokens, permitting seamless multimodal integration:

$E_V = f_{mm}(Q_{\text{out}})$

$\text{LLM input} = [E_V; E_P\ (\text{optional guidance}); E_T\ (\text{text})]$

Multi-scale features carry implicit localization via their cell indices, and queries are fixed in order. Optional scale embeddings may be added. Layer normalization precedes each attention and FFN block, in line with Pre-Norm Transformer practice.

4. Interfaces for Interaction and Segmentation

4.1 User-Input Prompting

Given user prompts (point, box, mask), a binary mask $m_u \in \{0,1\}^{N_R}$ over $F_R$ is constructed. Masked features are average-pooled to create the prompt feature $f_P \in \mathbb{R}^{d_f}$ and projected to $Q_1^P = W_Q^P f_P \in \mathbb{R}^{1\times d}$ . The concatenated query set $Q_1^{\text{total}} = [Q_1; Q_1^P] \in \mathbb{R}^{(M+1) \times d}$ feeds through MSDec. The dedicated prompt token is extracted and mapped via $f_{mm}$ to generate the guidance token $E_P$ for the LLM.

4.2 Segmentation-Mask Decoding

Upon emission of an [SEG] token by the LLM, its hidden state $H^S \in \mathbb{R}^{d_{LLM}}$ is projected via $f_s$ to $Q_1^S \in \mathbb{R}^{1\times d}$ , concatenated as $Q_1^{\text{seg}} = [Q_1; Q_1^S]$ . This passes through MSDec, and the final segmentation token is mapped via a mask kernel head $f_m$ to the kernel $k \in \mathbb{R}^{d_f}$ , then dot-multiplied against $F_R$ and passed through a sigmoid to derive the 3D mask:

$M = \text{Sigmoid}(F_R \cdot k) \in \mathbb{R}^{N_R \times 1}$

5. Implementation Considerations

Key parameters and practical suggestions include:

Number of Scales ( $R$ ): Empirically, $R=3$ (coarse, medium, fine) is optimal.
Number of Queries ( $M$ ): Approximately 850 achieves a balance between representation capacity and overfitting.
Feature Dimension ( $d_f$ ): Typically 256 or 512, matching encoder output.
Attention Heads: 8 per layer is standard.
Pre-Norm Transformer: LayerNorm before each attention/FFN; Dropout=0.1.
Optimized Attention: FlashAttention-2 used for memory and speed efficiency.

6. Training Objectives and Stages

MSDec is trained in two primary stages:

Stage 1: 3D Instance Segmentation Pre-training
- Classification loss $L_{\text{cls}}$ over instance categories.
- Mask loss $L_m$ (BCE+DICE) for instance masks.
- 2D–3D feature alignment loss: $L_s = (1/N_r)\sum_j [1 - \cos(F_r^j, F_r^{j,C})]$
- Total loss: $L = L_{\text{cls}} + \lambda_1 L_m + \lambda_2 L_s$
Stage 2: Instruction Tuning (freezing MSDec + encoder)
- Language modeling loss $L_t$ (next-token CE).
- Segmentation mask loss $L_m$ (if segmentation prompts).
- Text-embedding alignment: $L_s = 1 - \cos(H^{\hat{Y}}, H^Y)$ between predicted and ground-truth embeddings.
- Total: $L = L_t + \lambda_3 L_m + \lambda_4 L_s$

MSDec thus produces compact, information-rich scene tokens, offers a unified interface for interactive 3D scene reasoning, and supports complex multimodal tasks, streamlining the connection between high-resolution 3D geometry and LLMs (Tang et al., 26 Nov 2025).

Markdown Upgrade to Chat

References (1)

Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Scale NDT Decoder (MSDec).