IPCD-Net: Intrinsic Decomposition for 3D Point Clouds

Updated 17 November 2025

IPCD-Net is an end-to-end deep learning framework that separates 3D point clouds into per-point albedo and shading components, addressing challenges in unstructured data.
It employs Point Transformer v2 for permutation-invariant feature aggregation and a Projection-based Luminance Distribution module to capture global illumination cues.
The network enables precise texture editing, relighting, and point-cloud registration by significantly reducing shading errors and enhancing color accuracy.

Intrinsic Point-Cloud Decomposition Network (IPCD-Net) is an end-to-end deep learning architecture designed to separate albedo and shading components directly from colored 3D point clouds, enabling tasks such as relighting, texture editing, and robust registration under varying outdoor illumination. IPCD-Net addresses the fundamental challenges posed by the irregular nature of point-cloud data and the necessity to infer global illumination properties in the absence of explicit light direction or color, which prior image-based and point-based decomposition techniques fail to handle effectively.

1. Formulation of Intrinsic Decomposition for Point Clouds

The intrinsic decomposition task seeks, for each spatial location, to factor observed color into albedo and shading from a single observation, typically under a Lambertian assumption for reflectance. In the classical image setting, this is expressed as $I(p) = A(p) \cdot S(p)$ , where $I$ is the pixel color, $A$ the per-pixel albedo, and $S$ the shading due to illumination. IPCD-Net extends this paradigm to unordered point sets: for a point cloud represented by positions $P \in \mathbb{R}^{N\times 3}$ and observed colors $I \in \mathbb{R}^{N\times 3}$ , the aim is to learn functions predicting $\hat{A}, \hat{S} \in \mathbb{R}^{N\times 3}$ such that, at each point $i$ :

$I_i \approx \hat{A}_i \odot \hat{S}_i$

where $\odot$ denotes elementwise multiplication. All predictions and supervisory signals reside natively in point-cloud space, obviating rasterization or grid imposition.

2. Network Architecture and Pointwise Feature Aggregation

IPCD-Net processes input per-point features comprised of 3D coordinates $P_i$ and RGB color $I_i$ . For permutation-invariant feature learning, it employs Point Transformer v2 (PTv2) as a shared encoder, assembling k-nearest neighbor graphs and applying grouped vector attention, producing latent features $F \in \mathbb{R}^{N \times C}$ . Two “pre-estimate” heads, parameterized as small multi-layer perceptrons (MLPs), then predict initial albedo and shade estimates denoted $A' \in \mathbb{R}^{N\times 3}$ and $S' \in \mathbb{R}^{N\times 3}$ .

Downstream, global-light context is introduced by the Projection-based Luminance Distribution (PLD) module, whose output is concatenated per-point to the pre-estimates. Two refinement MLP heads subsequently yield the final predictions, $\hat{A}$ and $\hat{S}$ . All operations—attention, MLP layers, neighbor search—are natively set-based and respect the unordered, non-uniform density of point clouds.

3. Projection-based Luminance Distribution (PLD) and Global-Illumination Encoding

A principal difficulty in point-cloud decomposition is the absence of canonical image axes or global-light annotation. IPCD-Net’s PLD module estimates the light field over the point cloud from within the data. It samples 324 uniform directions $(\theta, \phi)$ over the upper hemisphere and, for each, rotates the cloud, renders an orthographic luminance image $L(\theta, \phi; u, v) \in \mathbb{R}^{H\times W}$ , and computes the mean luminance:

$\mathrm{PLD}(\theta, \phi) = \frac{1}{N_P} \sum_{u,v} L(\theta, \phi; u, v)$

where $N_P = H \cdot W$ . This collection, interpreted as a hemispherical luminance map, is embedded by SphereNet (a spherical convolution network) into a global-light feature vector $\ell \in \mathbb{R}^{d_L}$ (with $d_L = 3$ ). The hierarchical refinement proceeds by tiling $\ell$ to all $N$ points and concatenating with $A'$ , $S'$ to form input $X \in \mathbb{R}^{N\times (6 + d_L)}$ for the final MLP heads. This mechanism instructs the network to leverage coarse-to-fine light cues, improving both the removal of cast shadows from albedo and the color accuracy of shade while preserving local geometric variation.

4. Supervision, Loss Terms, and Learning

The training objective combines supervision on intermediate (pre-estimates) and final predictions using ground-truth decompositions available in synthetic data. For both albedo and shading, pointwise losses (Frobenius norm) are applied:

Pre-estimate losses: $L^{\mathrm{alb}}_{\mathrm{pre}} = \|A - A' \|_F$ , $L^{\mathrm{shd}}_{\mathrm{pre}} = \|S - S'\|_F$ , $L^{\mathrm{phy}}_{\mathrm{pre}} = \|I - A' \odot S'\|_F$
Final-estimate losses: $L^{\mathrm{alb}}_{\mathrm{pnt}} = \|A - \hat{A}\|_F$ , $L^{\mathrm{shd}}_{\mathrm{pnt}} = \|S - \hat{S}\|_F$ , $L^{\mathrm{phy}}_{\mathrm{pnt}} = \|I - \hat{A} \odot \hat{S}\|_F$

The total loss is

$L_{\mathrm{tot}} = L^{\mathrm{alb}}_{\mathrm{pnt}} + L^{\mathrm{shd}}_{\mathrm{pnt}} + L^{\mathrm{phy}}_{\mathrm{pnt}} + \lambda (L^{\mathrm{alb}}_{\mathrm{pre}} + L^{\mathrm{shd}}_{\mathrm{pre}} + L^{\mathrm{phy}}_{\mathrm{pre}})$

with $\lambda = 0.1$ weighting the auxiliary supervision. This regime encourages both accurate decomposition and faithful reconstruction at multiple network stages.

5. Dataset Construction and Training Protocol

IPCD-Net is trained and validated on a synthetic outdoor-scene dataset tailored for intrinsic decomposition in point clouds. The dataset comprises 30 distinct “assets” (building models with controllable albedo), each rendered with respect to three sun positions (morning, noon, evening) to create varied shading conditions. Pure-shade ground truth is computed by removing albedo and re-illuminating. For each condition, $10^6$ points are randomly sampled; ground-truth albedo, shading, and color per point are stored. The final set consists of 90 point-cloud scenarios, split by asset: 23 for training, 7 for test.

The pipeline utilizes PyTorch on NVIDIA H100 GPUs. Each training step samples $10^4$ points from the $10^6$ -point clouds. The encoder uses PTv2; PLD projections are rendered with PyTorch3D; SphereNet processes the PLD feature. PLD’s 324 views correspond to $10^\circ$ steps in elevation ( $0^\circ$ to $80^\circ$ ), azimuth ( $0^\circ$ to $350^\circ$ ), with images of size $256 \times 256$ . Optimization employs Adam with standard parameters.

6. Benchmarks, Ablations, and Quantitative Results

Evaluation metrics include per-point MSE ( $\times 10^{-2}$ ), MAE ( $\times 10^{-1}$ ), and PSNR (dB) for both albedo and shade. Comparative baselines are standard intrinsic image techniques (e.g., Retinex, NIID-Net, CD-IID, IID-Anything), a rendering-then-IID-then-reprojection pipeline (GS-IR), and ablated versions of IPCD-Net (w/o PLD, w/o HFR+PLD, w/o shared encoder, “base model”). Quantitative test-set results are:

Model	MSE $_\mathit{alb}$	MSE $_\mathit{shd}$	MAE $_\mathit{alb}$	MAE $_\mathit{shd}$	PSNR $_\mathit{alb}$	PSNR $_\mathit{shd}$
Baseline-A	18.9	29.1	3.58	4.27	7.57	5.96
NIID-Net	15.2	12.1	2.93	2.46	8.97	9.99
IPCD-Net_base	4.02	5.11	1.58	1.62	14.0	13.5
IPCD-Net	3.03	3.25	1.31	1.37	15.6	15.1

Ablation indicates that PLD provides the most reduction in shading error, hierarchical refinement supports albedo recovery, and shared encoding stabilizes training. The full model demonstrates clear quantitative improvements across all metrics.

7. Applications, Generalization, and Limitations

Practical Applications

Texture editing: Separating $I$ into $\hat{A}$ and $\hat{S}$ allows selective editing of the albedo. Recombining edited albedo with original shading prevents unnatural lighting artifacts that would result from direct modification.
Relighting: To transfer an object between lighting conditions, one computes $\hat{A}_1$ from input $I_1 = A_1 \odot S_1$ , then synthesizes $\tilde{I}_{1\rightarrow 2} = \hat{A}_1 \odot S_2$ . This operation mitigates residual shadows and achieves ground-truth-consistent appearance under novel illumination.
Point-cloud registration: Under changing light, ICP registration on original $I$ degrades as overlap falls. Registration using estimated albedo $\hat{A}$ recovers recall rates near those achievable with ground-truth.

Generalization and Real-World Evaluation

On SensatUrban (real urban LiDAR and imagery), IPCD-Net achieved the highest F1 relative to alternative baselines, with annotations over 900 reflectance-ordered point pairs. The method effectively reduces cast shadows and remains robust to noise-prone real scans.

Limitations and Prospects

PLD presumes Lambertian, diffuse-dominated scenes; severe specular or highly variable reflectance (e.g., black vs. white-mirror) can bias luminance statistics. Very sparse or occluded clouds degrade PLD reliability. Prospective advances include learned inpainting/completion to densify PLD projections and the adoption of newer point-cloud encoding backbones or integration with BRDF estimation for more general inverse rendering scenarios.

IPCD-Net constitutes the first end-to-end neural framework for learned decomposition of arbitrary colored point clouds into albedo and shade directly, leveraging point-wise feature aggregation and global-light analysis via PLD, with strong performance on synthetic and real-world benchmarks in decomposition fidelity and practical downstream tasks.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to IPCD-Net.