ResGrouped-MLP Quality Assessment Network
- The paper introduces a novel grouped-MLP architecture that leverages multi-scale RBF features and residual connections to accurately predict 3D point cloud quality.
- It employs a Split-Transform-Merge strategy to separately process luma, chroma, and curvature cues across high, medium, and low scales, preserving key semantic details.
- Channel-wise attention and hierarchical fusion mechanisms enhance sensitivity to perceptually salient distortions, ensuring robust objective quality assessment.
The ResGrouped-MLP Quality Assessment Network is an architectural component designed to predict perceptual quality scores for 3D point clouds by integrating multi-scale, physically-motivated feature representations through a structured multi-layer perceptron (MLP) with grouped encodings, residual connections, and channel-wise attention. Developed in the context of the MS-ISSM framework, ResGrouped-MLP departs from flat MLP designs to enable the network to distinctly encode, merge, and weigh luma, chroma, and geometric cues sampled across high, medium, and low spatial granularities, thereby preserving feature semantic integrity and enhancing reliability in objective point cloud quality assessment (Chen et al., 3 Jan 2026).
1. Input Feature Construction and Preprocessing
The input to ResGrouped-MLP consists of a 9-dimensional vector of feature differences, derived from a pipeline where both the original and distorted point clouds are first downsampled into three voxel-grid scales (voxel sizes 2.0, 4.0, and 8.0). For each sampled region, three attributes are extracted: Y-luma, Cr-chroma, and Cu-curvature, denoted , across scales . Radial Basis Function (RBF) implicit functions, , are fitted per feature and scale, yielding coefficient vectors .
The difference per coefficient, for a given feature and scale, is computed as
where and index the original and distorted point clouds. This difference is averaged over all coefficients per (feature, scale) pair, yielding . The aggregated vector is .
Before passing to the regression network, each element undergoes a Log-Modulus transformation, $\tilde{x} = \sign(x)\cdot \ln(1 + |x|)$, followed by z-score normalization to mitigate the heavy-tailed distribution of raw coefficient differences.
2. Grouped Encoding and Split-Transform-Merge Hierarchy
The core architectural principle is grouping: the 9-dimensional input vector is split into disjoint groups, each corresponding to a specific (feature, scale) pair:
where indexes the arrangement (e.g., , , ...). Each group, a single scalar, is independently processed by a small residual MLP specific to that group.
This Split-Transform-Merge strategy isolates the early transformation of luma, chroma, and geometry at each scale, with the explicit goal of avoiding premature mixing of distinct semantic cues and maintaining sensitivity to domain-specific distortions. Each group encoder comprises:
- An initial linear projection to a -dimensional vector: , with , .
- stacked residual blocks, each evaluated as:
where , with . LayerNorm normalizes across hidden units, and the SiLU activation is .
The use of residual connections (i.e., ) stabilizes training in deep MLP structures and facilitates effective gradient flow, while LayerNorm and SiLU support faster convergence, especially when optimizing over heavy-tailed features.
3. Channel-Wise Attention at Multiple Scales
After encoding, each group yields a -dimensional vector, and groups sharing the same scale are concatenated to form for each . On each , a channel-wise attention mechanism is applied:
- A bottleneck MLP computes attention weights :
where is typically derived via global average pooling or is itself, , , . The output is used for channel-wise scaling: .
This mechanism adaptively re-weights channel activations within each scale, imitating the human visual system's focus on salient distortions and supporting selective emphasis on particularly informative features.
4. Hierarchical Fusion and Final Regression
The three attended scale vectors are concatenated into . Optionally, a global channel attention block (mirroring the per-scale one) is applied, yielding . A two-layer MLP head with hidden and output dimensions produces the final scalar quality score:
5. Training Procedure and Hyperparameter Choices
The network is trained to minimize
where is the mean-squared error with the ground-truth quality score, penalizes mismatch with the Pearson Linear Correlation Coefficient, and captures ranking consistency. The weighting coefficients are determined by cross-validation (typical values: , ).
Optimized with AdamW (weight decay ), cosine annealing learning rate schedule for 80 epochs using a batch size of 32, the network is validated using repeated random train-test splits ( train / test, 5 repeats). Critical hyperparameters include:
| Module | Symbol | Default Value |
|---|---|---|
| Groups | 9 | |
| Group hidden dim | 64 | |
| Residual blocks/group | 2–3 | |
| Attention reduction | 4 | |
| MLP head dimensions | – | 9→4→1 |
Log-Modulus transformation prior to normalization is essential for robust loss function behavior by correcting for the heavy tails in RBF coefficient differences.
6. Architectural Motivation and Theoretical Context
The grouping scheme (Split-Transform-Merge) ensures luma, chroma, and curvature cues are independently encoded at each scale, maintaining sensitivity to their distinct physical semantics and preventing early information loss due to premature mixing. Residual blocks stabilize gradient propagation in deep, narrow MLPs; LayerNorm and SiLU promote more rapid convergence, especially in the presence of heavy-tailed features resulting from RBF coefficient computation.
Channel-wise attention modules, both per-scale and global, enable the network to mimic the hierarchical, adaptive focus mechanisms of the human visual system, such as increased attention to perceptually relevant distortions depending on regional or global salience.
The multi-scale fusion architecture, spanning H (High), M (Medium), and L (Low) granularities, echoes the multi-resolution analysis prominent in classical human visual system (HVS) models, allowing the network to aggregate both fine-grained and structural information in the final regression.
7. Empirical Utility and Domain Applicability
Within the MS-ISSM framework, ResGrouped-MLP effectively maps nine handcrafted, multi-scale implicit-RBF features into a single, objective quality estimate that demonstrates high correlation with human perceptual scores on multiple point cloud benchmarks. Its modular architecture is readily extensible to additional features, scales, or modalities that can be reduced to comparable handcrafted representations and is independent of the specifics of the point cloud dataset layout, voxelization scale, or RBF parameterization, except as constraints imposed by the input feature pipeline. This suggests broad applicability in the field of objective quality assessment for irregular geometric data, particularly where accurate mapping of physical feature changes to perceptual outcomes is required (Chen et al., 3 Jan 2026).