Papers
Topics
Authors
Recent
2000 character limit reached

MSDec: Multi-Scale Decoder for 3D Vision

Updated 29 November 2025
  • The paper introduces MSDec, a transformer-based decoder that fuses multi-scale NDT features to generate compact scene tokens for 3D tasks.
  • It employs voxel-based multi-scale feature extraction and cross-attention mechanisms to capture both fine details and global context in point clouds.
  • The architecture bridges 3D spatial context with large language models, facilitating advanced applications in segmentation, VQA, and dense captioning.

The Multi-Scale NDT Decoder (MSDec) is a @@@@1@@@@ module introduced within the NDTokenizer3D framework for generalist 3D vision-LLMs. MSDec is designed to fuse multi-scale Normal Distributions Transform (NDT) features from high-resolution point clouds into compact scene tokens, enabling a unified interface for language-level reasoning, interactive user prompting, and segmentation. This architecture supports general-purpose 3D scene understanding tasks such as referring segmentation, visual question answering, and dense captioning, and functions as a bridge between detailed 3D spatial context and LLMs (Tang et al., 26 Nov 2025).

1. Architectural Composition

MSDec operates as a transformer-style decoder, processing multi-scale 3D features extracted from the NDT representation. The architecture has three main parts:

  • Initial Queries: The process begins with learnable query tokens Q1RM×dQ_1 \in \mathbb{R}^{M \times d}, with MM as the number of scene tokens (commonly 850) and dd as the hidden dimension. Q1Q_1 is initialized via a linear projection of a down-sampled subset of the finest-scale features: Q1WQ1(FR)Q_1 \leftarrow W^1_Q (\downarrow F_R), where FRF_R represents the finest-scale features, and \downarrow denotes subsampling.
  • Cross-Scale Fusion Blocks (Decoder Layers): For each scale r=1...Rr=1...R, a transformer decoder layer is applied, utilizing the scale-rr features FrF_r as Keys/Values and the query tokens as Queries. Each layer encompasses: 1) Cross-attention between QrQ_r and (Fr, as K,V)(F_r,\text{ as }K,V) 2) Self-attention among updated queries 3) A position-wise feed-forward network (FFN)

The outputs are iteratively refined across layers, yielding QR+1=QoutRM×dQ_{R+1}=Q_{\text{out}}\in \mathbb{R}^{M \times d}, which encodes fused, multi-scale scene representation.

  • Token Projection Heads:
    • fmmf_{mm}: A 2-layer MLP for multimodal alignment, projecting QoutQ_{\text{out}} to the final scene tokens EVE_V for LLM consumption.
    • fmf_{m}: Instance segmentation classification and mask head used during pretraining.
    • fsf_{s}: Segmentation-query head active during instruction-tuning for mapping an LLM [SEG] token’s hidden state into an MSDec query.

2. Mathematical Formulation

2.1 Multi-Scale Feature Extraction

At each scale rr, the point cloud is partitioned into NrN_r voxels. For the jthj^\text{th} cell, the mean μrj\mu_r^j, covariance Σrj\Sigma_r^j, and average cell intensity crjc_r^j (across views) are computed:

μrj=1ni=1nxi\mu_r^j = \frac{1}{n}\sum_{i=1}^n x_i

Σrj=1n1i=1n(xiμrj)(xiμrj)T\Sigma_r^j = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu_r^j)(x_i - \mu_r^j)^T

crj=averagek[Ik(uk,vk)],    [uk,vk]T=P(μrjk)c_r^j = \text{average}_{k} [I_k(u_k, v_k)],\;\; [u_k, v_k]^T = P(\mu_r^j|k)

The cell descriptor Crj=[μrj;vec(Σrj);crj]R15C_r^j = [\mu_r^j; \text{vec}(\Sigma_r^j); c_r^j] \in \mathbb{R}^{15} is processed by a 3D encoder Φ\Phi (often a point-transformer) to yield Fr=Φ({Crj}j=1...Nr)RNr×dfF_r = \Phi(\{C_r^j\}_{j=1...N_r}) \in \mathbb{R}^{N_r \times d_f}.

2.2 Cross-Scale Fusion Mechanism

Feature fusion across scales is characterized as:

F=Fusion(F(1),,F(S))=s=1SWs[ϕ(F(s))]+bsF' = \text{Fusion}(F^{(1)}, \dots, F^{(S)}) = \sum_{s=1}^S W_s[\phi(F^{(s)})] + b_s

where ϕ\phi is a projection (via WsKW^K_s and WsVW^V_s for Keys/Values). In transformer notation:

CrossAttn(Q,K,V)=Softmax((QWq)(KWk)Td)(VWv)\text{CrossAttn}(Q, K, V) = \text{Softmax}\left(\frac{(Q W_q)(K W_k)^T}{\sqrt{d}}\right) (V W_v)

This nested cross-attention across scales enables the flow of global context at coarse levels and fine detail refinement at deep layers.

3. Scene Token Generation and Consumption

After R decoder layers, QoutRM×dQ_{\text{out}} \in \mathbb{R}^{M \times d} is projected via fmmf_{mm} to EVRM×dLLME_V \in \mathbb{R}^{M \times d_{LLM}}, producing the final scene tokens. These are concatenated before the LLM’s text tokens, permitting seamless multimodal integration:

EV=fmm(Qout)E_V = f_{mm}(Q_{\text{out}})

LLM input=[EV;EP (optional guidance);ET (text)]\text{LLM input} = [E_V; E_P\ (\text{optional guidance}); E_T\ (\text{text})]

Multi-scale features carry implicit localization via their cell indices, and queries are fixed in order. Optional scale embeddings may be added. Layer normalization precedes each attention and FFN block, in line with Pre-Norm Transformer practice.

4. Interfaces for Interaction and Segmentation

4.1 User-Input Prompting

Given user prompts (point, box, mask), a binary mask mu{0,1}NRm_u \in \{0,1\}^{N_R} over FRF_R is constructed. Masked features are average-pooled to create the prompt feature fPRdff_P \in \mathbb{R}^{d_f} and projected to Q1P=WQPfPR1×dQ_1^P = W_Q^P f_P \in \mathbb{R}^{1\times d}. The concatenated query set Q1total=[Q1;Q1P]R(M+1)×dQ_1^{\text{total}} = [Q_1; Q_1^P] \in \mathbb{R}^{(M+1) \times d} feeds through MSDec. The dedicated prompt token is extracted and mapped via fmmf_{mm} to generate the guidance token EPE_P for the LLM.

4.2 Segmentation-Mask Decoding

Upon emission of an [SEG] token by the LLM, its hidden state HSRdLLMH^S \in \mathbb{R}^{d_{LLM}} is projected via fsf_s to Q1SR1×dQ_1^S \in \mathbb{R}^{1\times d}, concatenated as Q1seg=[Q1;Q1S]Q_1^{\text{seg}} = [Q_1; Q_1^S]. This passes through MSDec, and the final segmentation token is mapped via a mask kernel head fmf_m to the kernel kRdfk \in \mathbb{R}^{d_f}, then dot-multiplied against FRF_R and passed through a sigmoid to derive the 3D mask:

M=Sigmoid(FRk)RNR×1M = \text{Sigmoid}(F_R \cdot k) \in \mathbb{R}^{N_R \times 1}

5. Implementation Considerations

Key parameters and practical suggestions include:

  • Number of Scales (RR): Empirically, R=3R=3 (coarse, medium, fine) is optimal.
  • Number of Queries (MM): Approximately 850 achieves a balance between representation capacity and overfitting.
  • Feature Dimension (dfd_f): Typically 256 or 512, matching encoder output.
  • Attention Heads: 8 per layer is standard.
  • Pre-Norm Transformer: LayerNorm before each attention/FFN; Dropout=0.1.
  • Optimized Attention: FlashAttention-2 used for memory and speed efficiency.

6. Training Objectives and Stages

MSDec is trained in two primary stages:

  • Stage 1: 3D Instance Segmentation Pre-training
    • Classification loss LclsL_{\text{cls}} over instance categories.
    • Mask loss LmL_m (BCE+DICE) for instance masks.
    • 2D–3D feature alignment loss: Ls=(1/Nr)j[1cos(Frj,Frj,C)]L_s = (1/N_r)\sum_j [1 - \cos(F_r^j, F_r^{j,C})]
    • Total loss: L=Lcls+λ1Lm+λ2LsL = L_{\text{cls}} + \lambda_1 L_m + \lambda_2 L_s
  • Stage 2: Instruction Tuning (freezing MSDec + encoder)
    • Language modeling loss LtL_t (next-token CE).
    • Segmentation mask loss LmL_m (if segmentation prompts).
    • Text-embedding alignment: Ls=1cos(HY^,HY)L_s = 1 - \cos(H^{\hat{Y}}, H^Y) between predicted and ground-truth embeddings.
    • Total: L=Lt+λ3Lm+λ4LsL = L_t + \lambda_3 L_m + \lambda_4 L_s

MSDec thus produces compact, information-rich scene tokens, offers a unified interface for interactive 3D scene reasoning, and supports complex multimodal tasks, streamlining the connection between high-resolution 3D geometry and LLMs (Tang et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Scale NDT Decoder (MSDec).