Papers
Topics
Authors
Recent
Search
2000 character limit reached

ResGrouped-MLP Quality Assessment Network

Updated 10 January 2026
  • The paper introduces a novel grouped-MLP architecture that leverages multi-scale RBF features and residual connections to accurately predict 3D point cloud quality.
  • It employs a Split-Transform-Merge strategy to separately process luma, chroma, and curvature cues across high, medium, and low scales, preserving key semantic details.
  • Channel-wise attention and hierarchical fusion mechanisms enhance sensitivity to perceptually salient distortions, ensuring robust objective quality assessment.

The ResGrouped-MLP Quality Assessment Network is an architectural component designed to predict perceptual quality scores for 3D point clouds by integrating multi-scale, physically-motivated feature representations through a structured multi-layer perceptron (MLP) with grouped encodings, residual connections, and channel-wise attention. Developed in the context of the MS-ISSM framework, ResGrouped-MLP departs from flat MLP designs to enable the network to distinctly encode, merge, and weigh luma, chroma, and geometric cues sampled across high, medium, and low spatial granularities, thereby preserving feature semantic integrity and enhancing reliability in objective point cloud quality assessment (Chen et al., 3 Jan 2026).

1. Input Feature Construction and Preprocessing

The input to ResGrouped-MLP consists of a 9-dimensional vector of feature differences, derived from a pipeline where both the original and distorted point clouds are first downsampled into three voxel-grid scales (voxel sizes 2.0, 4.0, and 8.0). For each sampled region, three attributes are extracted: Y-luma, Cr-chroma, and Cu-curvature, denoted F{Y,Cr,Cu}F\in\{Y,Cr,Cu\}, across scales β{H,M,L}\beta\in\{\text{H},\text{M},\text{L}\}. Radial Basis Function (RBF) implicit functions, fFα,βf_F^{\alpha,\beta}, are fitted per feature and scale, yielding coefficient vectors WFα,βW_F^{\alpha,\beta}.

The difference per coefficient, for a given feature and scale, is computed as

dF,β,k=wF,β,kOwF,β,kDmax(wF,β,kO,wF,β,kD),d_{F,\beta,k} = \frac{|w_{F,\beta,k}^O - w_{F,\beta,k}^D|}{\max(w_{F,\beta,k}^O, w_{F,\beta,k}^D)},

where OO and DD index the original and distorted point clouds. This difference is averaged over all coefficients kk per (feature, scale) pair, yielding dF,βd_{F,\beta}. The aggregated vector is x=[dY,H,dCr,H,dCu,H,dY,M,dCr,M,dCu,M,dY,L,dCr,L,dCu,L]R9x = [d_{Y,H}, d_{Cr,H}, d_{Cu,H}, d_{Y,M}, d_{Cr,M}, d_{Cu,M}, d_{Y,L}, d_{Cr,L}, d_{Cu,L}]^{\top} \in \mathbb{R}^9.

Before passing to the regression network, each element undergoes a Log-Modulus transformation, $\tilde{x} = \sign(x)\cdot \ln(1 + |x|)$, followed by z-score normalization to mitigate the heavy-tailed distribution of raw coefficient differences.

2. Grouped Encoding and Split-Transform-Merge Hierarchy

The core architectural principle is grouping: the 9-dimensional input vector is split into G=9G=9 disjoint groups, each corresponding to a specific (feature, scale) pair:

g(F,β)=xi(F,β),g_{(F,\beta)} = x_{i(F,\beta)},

where i(F,β){1,,9}i(F,\beta)\in\{1,\ldots,9\} indexes the arrangement (e.g., i(Y,H)=1i(Y,H)=1, i(Cr,H)=2i(Cr,H)=2, ...). Each group, a single scalar, is independently processed by a small residual MLP specific to that group.

This Split-Transform-Merge strategy isolates the early transformation of luma, chroma, and geometry at each scale, with the explicit goal of avoiding premature mixing of distinct semantic cues and maintaining sensitivity to domain-specific distortions. Each group encoder comprises:

  • An initial linear projection to a DD-dimensional vector: hc,s(0)=Wc,singc,s+bc,sinh^{(0)}_{c,s} = W^{in}_{c,s} \cdot g_{c,s} + b^{in}_{c,s}, with Wc,sinRD×1W^{in}_{c,s} \in \mathbb{R}^{D \times 1}, bc,sinRDb^{in}_{c,s} \in \mathbb{R}^{D}.
  • LL stacked residual blocks, each evaluated as:

hc,s()=hc,s(1)+F(hc,s(1);W()),h^{(\ell)}_{c,s} = h^{(\ell-1)}_{c,s} + F\bigl(h^{(\ell-1)}_{c,s};W^{(\ell)}\bigr),

where F(h;W)=W2SiLU(LayerNorm(W1h+b1))+b2F(h;W) = W_2\,\text{SiLU}(\text{LayerNorm}(W_1 h + b_1)) + b_2, with W1,W2RD×DW_1, W_2 \in \mathbb{R}^{D \times D}. LayerNorm normalizes across hidden units, and the SiLU activation is xσ(x)x\sigma(x).

The use of residual connections (i.e., xx+F(x)x\rightarrow x+F(x)) stabilizes training in deep MLP structures and facilitates effective gradient flow, while LayerNorm and SiLU support faster convergence, especially when optimizing over heavy-tailed features.

3. Channel-Wise Attention at Multiple Scales

After encoding, each group yields a DD-dimensional vector, and groups sharing the same scale are concatenated to form Hs=[hY,s;hCr,s;hCu,s]R3DH_s = [h_{Y,s}; h_{Cr,s}; h_{Cu,s}] \in \mathbb{R}^{3D} for each s{H,M,L}s\in\{\text{H},\text{M},\text{L}\}. On each HsH_s, a channel-wise attention mechanism is applied:

  • A bottleneck MLP computes attention weights asR3Da_s \in \mathbb{R}^{3D}:

as=σ(W2,sReLU(W1,szs+b1,s)+b2,s),a_s = \sigma\left(W_{2,s}\,\text{ReLU}(W_{1,s}\,z_s + b_{1,s}) + b_{2,s}\right),

where zsz_s is typically derived via global average pooling or is HsH_s itself, W1,sR(3D/r)×3DW_{1,s}\in \mathbb{R}^{(3D/r)\times 3D}, W2,sR3D×(3D/r)W_{2,s}\in \mathbb{R}^{3D\times(3D/r)}, r=4r=4. The output is used for channel-wise scaling: Hs=HsasH'_s=H_s \odot a_s.

This mechanism adaptively re-weights channel activations within each scale, imitating the human visual system's focus on salient distortions and supporting selective emphasis on particularly informative features.

4. Hierarchical Fusion and Final Regression

The three attended scale vectors HH,HM,HLH'_H, H'_M, H'_L are concatenated into Hall=[HH;HM;HL]R9DH_\text{all} = [H'_H; H'_M; H'_L] \in \mathbb{R}^{9D}. Optionally, a global channel attention block (mirroring the per-scale one) is applied, yielding Hall=HallagH_\text{all}' = H_\text{all} \odot a_g. A two-layer MLP head with hidden and output dimensions 9D4D19D \rightarrow 4D \rightarrow 1 produces the final scalar quality score:

q^=Wout,2SiLU(LayerNorm(Wout,1Hall+bout,1))+bout,2.\hat{q} = W_{out,2}\,\text{SiLU}\left(\text{LayerNorm}(W_{out,1}\,H_\text{all}' + b_{out,1})\right) + b_{out,2}.

5. Training Procedure and Hyperparameter Choices

The network is trained to minimize

Ltotal=LMSE+λ1LPLCC+λ2LRank,L_\text{total} = L_\text{MSE} + \lambda_1 L_\text{PLCC} + \lambda_2 L_\text{Rank},

where LMSEL_\text{MSE} is the mean-squared error with the ground-truth quality score, LPLCCL_\text{PLCC} penalizes mismatch with the Pearson Linear Correlation Coefficient, and LRankL_\text{Rank} captures ranking consistency. The weighting coefficients λ1,λ2\lambda_1, \lambda_2 are determined by cross-validation (typical values: λ1=1\lambda_1=1, λ2=0.1\lambda_2=0.1).

Optimized with AdamW (weight decay 1e21\mathrm{e}{-2}), cosine annealing learning rate schedule for 80 epochs using a batch size of 32, the network is validated using repeated random train-test splits (60%60\% train / 40%40\% test, 5 repeats). Critical hyperparameters include:

Module Symbol Default Value
Groups GG 9
Group hidden dim DD 64
Residual blocks/group LL 2–3
Attention reduction rr 4
MLP head dimensions 9DD→4DD→1

Log-Modulus transformation prior to normalization is essential for robust loss function behavior by correcting for the heavy tails in RBF coefficient differences.

6. Architectural Motivation and Theoretical Context

The grouping scheme (Split-Transform-Merge) ensures luma, chroma, and curvature cues are independently encoded at each scale, maintaining sensitivity to their distinct physical semantics and preventing early information loss due to premature mixing. Residual blocks stabilize gradient propagation in deep, narrow MLPs; LayerNorm and SiLU promote more rapid convergence, especially in the presence of heavy-tailed features resulting from RBF coefficient computation.

Channel-wise attention modules, both per-scale and global, enable the network to mimic the hierarchical, adaptive focus mechanisms of the human visual system, such as increased attention to perceptually relevant distortions depending on regional or global salience.

The multi-scale fusion architecture, spanning H (High), M (Medium), and L (Low) granularities, echoes the multi-resolution analysis prominent in classical human visual system (HVS) models, allowing the network to aggregate both fine-grained and structural information in the final regression.

7. Empirical Utility and Domain Applicability

Within the MS-ISSM framework, ResGrouped-MLP effectively maps nine handcrafted, multi-scale implicit-RBF features into a single, objective quality estimate that demonstrates high correlation with human perceptual scores on multiple point cloud benchmarks. Its modular architecture is readily extensible to additional features, scales, or modalities that can be reduced to comparable handcrafted representations and is independent of the specifics of the point cloud dataset layout, voxelization scale, or RBF parameterization, except as constraints imposed by the input feature pipeline. This suggests broad applicability in the field of objective quality assessment for irregular geometric data, particularly where accurate mapping of physical feature changes to perceptual outcomes is required (Chen et al., 3 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ResGrouped-MLP Quality Assessment Network.