Video Aesthetic Quality Assessment

Updated 27 April 2026

Video Aesthetic Quality Assessment is the process of modeling and predicting a video's perceptual appeal by analyzing subjective factors like composition, lighting, and emotional impact.
Modern approaches use dual- or multi-branch architectures and composite loss functions to disentangle and fuse aesthetic attributes with technical quality.
Applications range from automated content curation and editing to guiding AI-generated video synthesis, with ongoing research on temporal dynamics and cross-domain generalization.

Video Aesthetic Quality Assessment (Video AQA) is the task of quantitatively modeling and predicting the perceptual appeal of video content, focusing on high-level, subjective factors such as composition, lighting, and emotional impact. Unlike traditional Video Quality Assessment (VQA), which typically measures objective impairments (e.g., compression artifacts, blurring), Video AQA explicitly targets the mechanisms that govern human judgments of “beauty,” cinematic merit, and artistic intent in videos (Pu et al., 15 Sep 2025, Wu et al., 2022). This domain has evolved rapidly due to the emergence of diverse, large-scale datasets with professional multi-dimensional annotations, refined multi-branch architectures, and training paradigms that integrate both technical and aesthetic components. Video AQA now spans user-generated content, AI-generated video, and complex multimodal scenes, demanding robust, interpretable, and generalizable assessment systems.

1. Key Concepts and Problem Formulation

Video Aesthetic Quality Assessment seeks to model and predict how human viewers appraise the overall perceptual attractiveness of a video. Unlike single-score VQA metrics, AQA emphasizes decomposition into interpretable, perceptually orthogonal dimensions: overall aesthetics, composition, lighting, color, shot size, and others (Pu et al., 15 Sep 2025, Qiao et al., 29 Oct 2025, Lin et al., 18 Feb 2026). The predominant annotation protocol involves collecting Mean Opinion Scores (MOS) or integer/ordinal ratings for both holistic (overall appeal) and attribute-specific (e.g., “composition,” “lighting,” “color grading”) aesthetic dimensions, often accompanied by language rationales or chain-of-thought explanations.

The primary technical challenge in AQA is the disentanglement of perceptual aesthetics from technical quality. Empirical studies reveal that most subjective video MOS are a weighted combination of aesthetic and technical perceptions, but each dimension is driven by distinct visual cues and semantics (Wu et al., 2022). Modern approaches formalize this decomposition, learning separate mapping functions or model branches (e.g., via dual-stream or multi-task architectures) that capture high-level aesthetics and low-level technical fidelity, then fusing their predictions according to empirically derived weights (Wang et al., 13 Jun 2025, Wu et al., 2022).

2. Datasets, Annotation Protocols, and Taxonomies

The field is grounded in several large-scale, professionally annotated datasets that underpin both training and benchmarking:

DIVIDE-3k: 3,590 user-generated videos, rated by 51 experts along overall, aesthetic, and technical axes, as well as proportion-of-impact labeling (Wu et al., 2022).
VADB: 10,490 videos, 13+ professional annotations per video, covering 10 attribute-level aesthetic scores (composition, lighting, shot size, depth of field, etc.), with attribute tags and free-form textual comments (Qiao et al., 29 Oct 2025).
UltraVQA: ≈40,000 user-generated video clips, each labeled by at least 3 professionals along five subjective dimensions, including “Aesthetic Quality” with sub-attribute tags (composition, lighting, color grading, visual appeal) and text rationales (Lin et al., 18 Feb 2026).
MVQA-68K: Over 38,000 videos, 68,000+ QA pairs, seven orthogonal dimensions (overall aesthetics, composition, camera movement, texture, visual quality, factual consistency), with fine-grained chain-of-thought rationales for each score (Pu et al., 15 Sep 2025).
VideoAesBench: 1,804 videos spanning UGC, AI-generated, robotic, compressed, and game domains, assessed via 12 atomic dimensions and multiple question formats, benchmarking 23 open/commercial LMMs on multi-faceted aesthetic reasoning (Li et al., 29 Jan 2026).

Consensus protocols generally employ Likert scales (e.g., 1–10, 1–5, 0.5 steps) (Qiao et al., 29 Oct 2025, Lin et al., 18 Feb 2026), raters with professional credentials (3–37+ per video), and quality control via statistical filtering and snap-to-grid averaging. Inter-rater reliability for core aesthetic labels is moderate to high (Krippendorff’s α ≈ 0.59–0.76 for overall/attribute-level scores) (Qiao et al., 29 Oct 2025, Lin et al., 18 Feb 2026).

Aesthetic dimensions often fall under three holistic taxonomies: visual form (composition, shot size, elements/structure), visual style (lighting, color, tone, creativity), and affectiveness (emotion, theme, viewer interest) (Li et al., 29 Jan 2026). Attribute-level labels and textual rationales substantially improve interpretability and model alignment with human preference (Qiao et al., 29 Oct 2025, Lin et al., 18 Feb 2026, Pu et al., 15 Sep 2025).

3. Model Architectures and Methodological Advances

Contemporary Video AQA frameworks employ complex, modular architectures:

Dual-/Multi-Branch Designs: DOVER and EyeSim-VQA decompose input videos into two streams: an aesthetic branch (uniformly sampled, downscaled frames for high-level semantics/composition) and a technical branch (patch-based, fine-detail, contiguous clips for distortion modeling). Each branch uses specialized backbones—often 3D ConvNeXt for aesthetics (pretrained on AVA) and Video Swin-Tiny for technical, with heads trained for perspective-specific prediction. Linear fusion with human-aligned weights yields overall MOS (Wu et al., 2022, Wang et al., 13 Jun 2025).
Free-Energy-Guided Repair: EyeSim-VQA introduces per-branch enhancement modules trained with variational free-energy-inspired composite loss, simulating human visual system restoration. Enhancement modules employ BasicVSR-mini and CleanNet, with pixel, identity, and IQA-guided perceptual losses (Wang et al., 13 Jun 2025).
Multimodal and Vision-Language Decoders: Architectures such as AIGV-Assessor and UltraVQA use large vision-language transformers (VLMs, e.g., Qwen2.5-VL-7B, InternVL2-8B) for frame embedding and reasoning over both video content and optionally text prompts or comments. Multimodal fusion and LoRA-based fine-tuning enable joint regression over multi-dimensional outputs and rationales (Lin et al., 18 Feb 2026, Wang et al., 2024, Pu et al., 15 Sep 2025).
Contrastive Pre-training: VADB-Net pre-trains video encoders using bidirectional video-comment/tag contrastive learning, enabling the extraction of fine-grained, attribute-sensitive aesthetic features, followed by frozen regression heads (Qiao et al., 29 Oct 2025).
Chain-of-Thought and Causal Reasoning: MVQA-68K and similar work show that models trained with explicit rationales outperform purely regressive or classification approaches, particularly on zero-shot generalization tasks (Pu et al., 15 Sep 2025).

4. Multi-dimensional Prediction, Training Objectives, and Loss Functions

Recent AQA frameworks shift from scalar MOS regression to multi-dimensional, interpretable prediction:

Dimensional Output: Models regress or classify videos along tightly defined quality axes: overall aesthetics, composition, lighting, technical fidelity, and in AIGC video assessment, temporal coherence and text–video alignment (Zhang et al., 2024, Wang et al., 2024, Pu et al., 15 Sep 2025).
Composite Loss Formulations: Dual/multi-branch architectures optimize objective functions including mean squared error/cross-entropy per-dimension, composite rank/regression objectives (e.g., $\mathcal{L}_{\mathrm{Rel}}$ ), and regularization terms for branch disentanglement (cross-scale restraint, contrastive loss, KL regularization in ASO) (Wang et al., 13 Jun 2025, Wu et al., 2022, Lin et al., 18 Feb 2026).
Analytic Score Optimization (ASO): This closed-form, KL-regularized objective reweights model predictions according to deviation from human ground truth, yielding soft targets that enforce correct ordinal structure and stabilize post-training alignment (Lin et al., 18 Feb 2026):

$\pi^*(s\mid x) = \frac{1}{Z(x)} \pi_{\mathrm{ref}}(s\mid x) \exp\left(\frac{1}{\lambda}R(s,s^*(x))\right)$

Pairwise and Preference Learning: AIGV-Assessor, UGVQ, and related models combine absolute MOS regression with preference learning loss on video pairs, aligning model outputs with human ranking (Wang et al., 2024, Zhang et al., 2024).

5. Benchmarks, Evaluation Metrics, and Empirical Results

Evaluation in Video AQA employs standard statistical correlation metrics and task-specific accuracy:

Metric	Definition (LaTeX notation)
Spearman’s ρ	$\rho = 1 - \frac{6\sum_i (r_i - s_i)^2}{n(n^2-1)}$
Pearson’s r	$r = \frac{\sum_i(\hat y_i-\bar{\hat y})(y_i-\bar y)}{\sqrt{\sum_i(\hat y_i-\bar{\hat y})^2\sum_i(y_i-\bar y)^2}}$
MAE	$\frac{1}{N}\sum_i\|\hat y_i - y_i\|$
Acc@ $\delta$	Fraction of predictions within $\pm\delta$ of ground truth

State-of-the-art models attain high performance:

DOVER++: SRCC/PLCC of 0.8442/0.8537 on DIVIDE-3k, 0.888/0.889 on LSVQ_test, with attribute-specific predictions and reliable disentanglement (Wu et al., 2022).
VADB-Net: Overall SRCC of 0.93, PLCC of 0.93 on VADB; SRCC $>$ 0.89 on all attribute-level scores (Qiao et al., 29 Oct 2025).
ASO (UltraVQA): [email protected] of 85%, MAE=0.357, SRCC=0.824, PLCC=0.837 on aesthetic quality (Lin et al., 18 Feb 2026).
AIGV-Assessor: Absolute pairwise accuracy improvements ( $+6.9\%$ static, $+13.7\%$ temporal) and strong correlation gains for AIGC video, with full spatiotemporal–LMM integration (Wang et al., 2024).
EyeSim-VQA: SROCC up to 0.919/0.919 (KoNViD-1k/LIVE-VQC), bridging interpretability and biological plausibility with free-energy-inspired “self-repair” (Wang et al., 13 Jun 2025).

6. Current Limitations, Open Problems, and Directions

Despite significant progress, limitations remain:

Representation of Temporal Dynamics: Models underperform on dimensions requiring deep temporal integration (e.g., camera movement, dynamic degree, multi-shot semantics). Leading benchmarks like VideoAesBench show <60–65% accuracy on complex multi-choice or open-ended tasks involving temporal reasoning (Li et al., 29 Jan 2026).
Cross-Domain and Cross-Cultural Generalization: Most existing models and datasets are trained on UGC or professionally produced content with limited diversity; performance degrades on highly stylized AIGC or regional genres (Zhang et al., 2024, Qiao et al., 29 Oct 2025).
Alignment with Human Rationales: Even with chain-of-thought supervision, models may misinterpret stylistic conventions (e.g., favoring saturated filters, missing subtle framing decisions) or overfit to dominant annotation styles (Lin et al., 18 Feb 2026, Pu et al., 15 Sep 2025).
Attribute Interoperability: Disentangling and fusing attributes (e.g., how color grading and lighting jointly affect perception) is not yet solved; current systems rely on linear or softmax fusion rather than rich causal reasoning (Pu et al., 15 Sep 2025).
Evaluation of Multimodal and LMM-based Systems: Zero-shot, prompt-based protocols expose significant blind spots, especially on open-ended, cross-category reasoning or highly creative aesthetic tasks (Li et al., 29 Jan 2026).

Potential improvements include the expansion of annotation protocols to include semantic storytelling, finer-grained emotion, or narrative impact, culture- or domain-adapted models, and hierarchical architectures that fuse local and global, spatial and temporal, and technical and artistic cues (Pu et al., 15 Sep 2025, Zhang et al., 2024). Analytical objectives such as ASO offer stable, closed-form solutions to the noisy alignment of human-annotated scores.

7. Applications and Future Developments

Video AQA underpins applications in automated editing, content recommendation, AI-generated video synthesis, and robotic/cinematic planning:

Content selection and curation: Automatic recommendation and search pipelines rely on AQA modules to rank and filter content based on predicted appeal (Wu et al., 2022, Qiao et al., 29 Oct 2025).
AIGC optimization: Multi-dimensional AQA guides generation pipelines to enhance specific attributes (e.g., composition, motion smoothness) according to target application (Zhang et al., 2024, Wang et al., 2024).
Cinematic robotics: Aesthetic-based path planning and professional segment detection in UAV footage improve automated cinematography (Kuang et al., 2020).
Explainability: Attribute and rationale outputs enable human-in-the-loop adjustments and facilitate critique and iterative refinement (Pu et al., 15 Sep 2025, Qiao et al., 29 Oct 2025).
Benchmarking and evaluation of LMMs: VideoAesBench and similar benchmarks provide standardized, multi-format testing of emerging vision-language and multimodal systems (Li et al., 29 Jan 2026).

Forthcoming research is likely to focus on end-to-end multimodal integration, extension to continuous score distributions, richer semantic embeddings for abstract attributes, and broader, cross-cultural annotation efforts to close the gap between human aesthetic judgment and automated video understanding.