DynaMesh-Rater: 4D Mesh Quality Assessment
- DynaMesh-Rater is a multimodal framework that fuses visual, motion, and geometry features to continuously assess the perceptual quality of dynamic 4D human meshes.
- It employs a LoRA-based instruction tuning mechanism on a large language model to optimize continuous quality score regression from unified multimodal embeddings.
- Experimental results on the DHQA-4D dataset demonstrate superior accuracy and robustness over traditional full-reference and no-reference quality assessment methods.
DynaMesh-Rater is a novel large multimodal model (LMM)-based framework for perceptual quality assessment of dynamic 4D digital human meshes, with particular emphasis on textured and non-textured mesh sequences subject to various distortions during acquisition, transmission, and processing (Li et al., 4 Oct 2025). The system integrates visual, motion, and geometry features extracted from mesh and projected video representations, and employs a LoRA-based instruction tuning mechanism to optimize continuous quality score regression via a unified LLM backend. Experiments on the DHQA-4D dataset, containing 4D mesh sequences and corresponding mean opinion scores (MOS), demonstrate that DynaMesh-Rater surpasses prior full-reference and no-reference image and video quality assessment methods in both accuracy and robustness.
1. Framework Architecture and Feature Modalities
DynaMesh-Rater operates by processing dynamic 4D mesh sequences (), together with associated 2D videos () rendered from these meshes. The computational pipeline consists of three main branches for feature extraction:
- Visual Features (): Derived from sparsely sampled projected video frames using a state-of-the-art vision transformer backbone (e.g., InternViT). Encoded features are then adapted for the LMM via a dedicated two-layer MLP projector : .
- Motion Features (): The projected 2D video sequence is segmented into uniform clips , with each segment processed by a SlowFast network for temporal feature encoding. These features are then mapped using another MLP projector : .
- Geometry Features (): Direct mesh features are extracted via dihedral angle statistics : mean, variance, entropy, GGD parameters, and Gamma distribution parameters. A third MLP projects these shape descriptors into the LMM input space: where .
After independent encoding, , , and are concatenated and input to the LLM, which is augmented with an MLP head for quality score regression.
2. Multi-Modal Feature Integration and Quality Regression
In contrast to conventional classification-based or discrete level quality assessment, DynaMesh-Rater regresses a continuous perceptual quality score in the range $0$–$100$ by jointly analyzing the multi-modal embeddings. The final prediction module reads the last hidden states of the LLM and outputs the predicted score, , where encapsulates fused information from all feature types.
The simultaneous consideration of appearance, temporal dynamics, and geometric details confers superior robustness to DynaMesh-Rater compared to models based on a single modality. For instance, geometry features are explicitly designed to capture mesh-specific distortions (such as irregularity in face angles), which cannot be reliably inferred from visual features alone, while motion branch descriptors are sensitive to dynamic artifacts such as temporal inconsistency or jitter.
3. Instruction Tuning via LoRA
A central design element is the application of Low-Rank Adaptation (LoRA) for efficient instruction tuning. LoRA is applied both to pretrained vision encoders and the LLM itself, introducing low-rank updates to weight matrices without altering the full parameterization. This mechanism enables the model to adapt representations for the specific downstream task of mesh quality assessment, learning human perceptual preferences from continuous annotation examples.
LoRA-based tuning ensures that the model achieves strong generalization performance with modest computational overhead, as the number of updated parameters is a fraction of the total. It also enhances compatibility between independently encoded feature modalities and the LLM backbone.
4. DHQA-4D Dataset and Experimental Protocol
The DHQA-4D dataset, constructed for this purpose, comprises:
| Category | Description | Quantity |
|---|---|---|
| Raw 4D mesh sequences | High-quality real-scanned human avatars (with temporal frames) | 32 sequences |
| Distorted meshes | Meshes degraded by 11 distinct textured distortions | 1920 |
| Mean opinion scores (MOS) | Subjective ratings for both textured/non-textured mesh variants | Provided |
Evaluation metrics include Spearman Rank Order Correlation Coefficient (SRCC), Pearson Linear Correlation Coefficient (PLCC), and Kendall’s Rank Correlation Coefficient (KRCC), measuring agreement between predicted and ground-truth MOS.
5. Experimental Results and Ablation
On both textured and non-textured subsets, DynaMesh-Rater demonstrates superior correlation with human judgments (e.g., SRCC for textured meshes) compared with state-of-the-art image/video quality and no-reference mesh quality assessment methods (e.g., PSNR, SSIM, MANIQA, KSVQE). Ablation analysis confirms that:
- Visual features alone are insufficient for capturing shape-based degradations.
- Geometry features provide discriminative power for mesh-specific irregularities.
- Motion features improve sensitivity to temporal artifacts and enhance overall score accuracy.
- LoRA instruction tuning consistently elevates performance compared to naive finetuning.
The system effectively combines all modalities for comprehensive perceptual quality prediction.
6. Model Significance and Application Contexts
DynaMesh-Rater addresses the critical need for automatic, multimodal perceptual quality assessment in 4D digital human avatars—a domain lacking robust, standardized metrics due to the complexity of mesh topology, texture, and dynamic movement. It enables quantitative evaluation of both textured and non-textured meshes under diverse noise conditions encountered in data capture, compression, and streaming.
Application scenarios include:
- Game production and animation generation, where mesh quality directly affects user experience.
- Immersive remote communication with dynamic avatars, requiring consistent temporal and geometric fidelity.
- Benchmarking mesh processing pipelines for robustness against complex, multimodal degradations.
7. Limitations and Future Directions
While DynaMesh-Rater establishes a framework for continuous, multimodal mesh quality prediction, its efficacy depends on the diversity and representativeness of the DHQA-4D dataset. The approach is tailored to human avatars with available video projections and explicit mesh features; generalization to other mesh-based domains or situations lacking video proxies may require modifying feature extraction routines. Instruction-tuned LMMs may also require retraining on new domains or additional types of mesh distortions.
A plausible implication is that the model’s architecture allows further extension to include semantic-level features and rater-specific aggregation (analogous to developments in multi-rater medical image segmentation models), as well as dynamic scoring in interactive or real-time quality assessment settings.
In summary, DynaMesh-Rater provides a unified multimodal, instruction-tuned system for continuous perceptual quality assessment of dynamic 4D digital human meshes. By integrating visual, motion, and geometry features through a large multimodal model trained via LoRA-based instruction, it yields state-of-the-art accuracy relevant for complex avatar production and evaluation pipelines (Li et al., 4 Oct 2025).