UniPercept-Bench: Perceptual Image Benchmark
- UniPercept-Bench is a comprehensive evaluation suite for perceptual image understanding, spanning aesthetics, quality, and texture assessments.
- It employs a hierarchical taxonomy with domains IAA, IQA, and ISTA, backed by large-scale curated datasets and explicit mathematical metrics.
- The benchmark integrates unified visual rating and VQA tasks via a strong multimodal baseline (UniPercept), yielding significant performance gains over specialized models.
UniPercept-Bench is a comprehensive benchmark and evaluation suite designed for perceptual-level image understanding, focusing on three major domains: Image Aesthetics Assessment (IAA), Image Quality Assessment (IQA), and Image Structure and Texture Assessment (ISTA). It introduces a hierarchical definition system, large-scale curated datasets, unified evaluation tasks, precise mathematical metrics, and a strong multimodal baseline (UniPercept) to advance multimodal LLMs’ (MLLMs) capability in granular perceptual reasoning, visual rating, and reward modeling (Cao et al., 25 Dec 2025).
1. Hierarchical Definition System
UniPercept-Bench structures perceptual-level image understanding into a three-tier hierarchy: Domain, Category, and Criterion.
- Domains:
- IAA: Concerns holistic, subjective judgments such as visual appeal, composition, style, emotion, narrative, and originality.
- IQA: Encompasses both objective and subjective evaluation of technical fidelity, including noise, blur, compression artifacts, exposure, and color naturalness.
- ISTA: Emphasizes more objective assessment of local texture patterns, material representation, 2D contours, 3D volumetric form, and semantic style.
Each domain is further subdivided:
| Domain | Category Example | Criterion Examples |
|---|---|---|
| IAA | Composition & Design | Visual Balance, Structural Organization, Rhythm |
| IQA | Distortion Location, Severity | Location Description, Severity Level, Type |
| ISTA | Physical Structure, Material | Base Morphology, Material Class, Volumetric Form |
The full taxonomy is defined in the paper’s Appendix, with per-category and per-criterion distributions calibrated to maximize intra-domain coverage and prevent bias towards specific feature groups (Cao et al., 25 Dec 2025).
2. Mathematical Metrics
UniPercept-Bench introduces explicit quantitative metrics to operationalize perceptual-level evaluation, specifically for ISTA:
- Texture Intensity Mapping assigns a weight to each base-morphology term :
- Component-Level ISTA Score:
- Image-Level ISTA Score for component set :
These formulations provide a structured, reproducible basis for perceptual-level benchmarking across annotated datasets (Cao et al., 25 Dec 2025).
3. Dataset Construction
UniPercept-Bench is composed of extensively curated and annotated datasets for model training and evaluation across its three core tasks.
- Domain-Adaptive Pre-Training Corpus (~800K samples):
- IAA text-QA: APDDv2, Impressions, AVA, TAD66K, FLICKR-AES (~360K pairs)
- IAA rating: ArtiMuse-10K (~9K images, 0–100 scores)
- IQA text-QA: Q-Ground-100K, DepictQA v1/v2, SPAQ, KADID, PIPAL (~380K pairs)
- IQA rating: KonIQ-10K (~7K)
- ISTA text-QA: DTD, FMD, Reachspaces, Scene Size & Clutter, Flickr2K, LSDIR (~40K)
- ISTA annotations: GPT-4o with expert refinement (~40K)
- Visual Rating (VR) Datasets: ArtiMuse-10K (IAA), KonIQ-10K (IQA), ISTA-10K (ISTA)
- Visual Question Answering (VQA) Dataset: ~30K training, ~6K test pairs, covering 44 QA categories across domains
QA data generation follows a three-step pipeline: (1) MLLM-driven annotation/template QA generation, (2) reject sampling adjudicated by Qwen-2.5-VL, and (3) human expert refinement. All domains are represented in approximately equal proportions (~33% each), and intra-domain category/criterion distributions are balanced for comprehensive coverage (Cao et al., 25 Dec 2025).
4. Evaluation Protocols and Task Structure
UniPercept-Bench evaluates models on two distinct but complementary tasks:
- Visual Rating (VR): Models predict a continuous perceptual-level score in per image per domain.
- Visual Question Answering (VQA): Models answer multi-choice or open-ended questions aligned with the taxonomy.
Primary metrics:
| Task | Metric(s) |
|---|---|
| VR | Spearman’s () and Pearson () correlation |
| VQA | Accuracy (held-out test questions) |
Reward structures for reinforcement learning include a binary reward for VQA and an Adaptive Gaussian Soft Reward for VR:
Policy optimization is accomplished using a clipped PPO-style gradient (termed GRPO) across both VR and VQA (Cao et al., 25 Dec 2025).
5. UniPercept Baseline Model
UniPercept provides a unified baseline using an InternVL3-8B backbone (vision encoder plus medium LLM). The two-stage training pipeline is:
- Stage 1: Domain-adaptive pre-training using supervised cross-entropy loss on all generated textual and rating targets.
- Stage 2: Task-aligned reinforcement learning (GRPO policy optimization) using the task-specific reward formulations above, jointly optimizing for VR and VQA.
Training employed 16 × A100 GPUs, batch size 128, 2 epochs per stage, and hyperparameters rollouts, , , , and adaptivity parameter (empirically set). The model is trained to simultaneously generalize across evaluations of technical quality, aesthetic judgment, and structure/texture assessment (Cao et al., 25 Dec 2025).
6. Performance Results
- Visual Rating (VR): Across all domains, UniPercept achieves significantly higher Spearman/Pearson correlations compared to next-best specialized and generalist models:
| Domain | UniPercept (SRCC/PLCC) | Next Best Specialized | Next Best Generalist |
|---|---|---|---|
| IAA | 0.590 / 0.586 | ~0.398 / 0.395 | |
| IQA | 0.824 / 0.827 | 0.753 / 0.750 | |
| ISTA | 0.778 / 0.767 | 0.262 / 0.345 |
- Visual Question Answering (VQA) Accuracy:
| Domain | UniPercept | Next Best General | Random Baseline |
|---|---|---|---|
| IAA | 76.55% | 68.28% | 23.9% |
| IQA | 81.07% | 72.15% | 23.2% |
| ISTA | 84.23% | 81.13% | 28.0% |
Ablation studies confirm that both domain-adaptive pre-training and the adaptive soft reward function are critical for stable performance. Removing these components strongly reduces VR (from 0.590/0.586 to 0.481/0.421) and VQA accuracy (from 80.62% to 74.75%). Joint optimization across VR and VQA yields mutual benefit; restricting to single-task or single-domain objectives noticeably degrades generalization (Cao et al., 25 Dec 2025).
7. Applications and Future Prospects
UniPercept-Bench’s baseline can serve as a plug-and-play reward model in text-to-image (T2I) generative pipelines. Integrated with the FLUX.1-dev backbone and a Flow-GRPO finetuning pipeline, UniPercept provides three independent reward signals (IAA, IQA, ISTA) or a unified reward for guiding generative models. Empirical results show that using the benchmark’s perceptual-level rewards improves external metrics (e.g., ArtiMuse, DeQA, ISTA ratings), with combined rewards yielding the most balanced performance (Cao et al., 25 Dec 2025).
Limitations identified include dataset scale (smaller than many semantic benchmarks), inherent subjectivity in IAA, dependency of ISTA annotation quality on hybrid MLLM+human pipelines, and the lack of reported statistical significance for model differences. Future directions include scaling datasets, adding further perceptual dimensions, incorporating RLHF for subjective tasks, and unifying perceptual and semantic benchmarks for end-to-end model training.
The introduction of UniPercept-Bench and its strong baseline establishes a rigorous, extensible foundation for unified perceptual-level multimodal image understanding, positioning perceptual features as central evaluation criteria for future MLLM and T2I systems (Cao et al., 25 Dec 2025).