FinPercep-RM: Fine-Grained Perceptual Rewards
- The paper introduces FinPercep-RM, which replaces coarse global rewards with fine-grained, localized signals to enhance task-specific alignment in vision and multimodal domains.
- It employs both deterministic formulas and encoder–decoder architectures, integrating joint vision–language models to generate dense reward maps for tasks like detection, classification, and super-resolution.
- Empirical validation shows significant gains, such as a 24.3% accuracy increase in fine-grained classification and improved mAP in detection, underscoring its impact on reinforcement learning performance.
A Fine-grained Perceptual Reward Model (FinPercep-RM) is a specialized framework that provides dense, deterministic, or learned reward signals reflecting quantitative perceptual properties—such as spatial alignment, local defects, temporal transients, and human ordinal judgments—in vision, video, and generative modeling tasks. FinPercep-RM paradigms are designed to replace global or binary feedback with highly task-specific, local, or ordinal signals, supporting reinforcement learning and direct policy optimization for large multimodal and generative models in low-data regimes.
1. Fundamental Concepts and Task-specific Reward Functions
FinPercep-RM implementations span multiple domains including vision-language reasoning, real-world super-resolution, ordinal human feedback modeling, multimodal video understanding, and unified perceptual image analysis. Core approaches are either rule-based (deterministic formulas driven by ground-truth alignment) or learned (encoder–decoder architectures yielding dense local maps or scores).
Visual detection/classification/grounding tasks adopt deterministic, computable rewards:
- Object Detection: For an input (image + question), response (predicted bounding boxes/confidences), and ground-truth , the detection reward is:
where is the average IoU of matched boxes above threshold , penalizes false positives via confidence, and checks output schema (Liu et al., 3 Mar 2025).
- Fine-grained Classification:
with indicating correct class prediction, schema adherence (Liu et al., 3 Mar 2025).
- Reasoning grounding reuses , validating spatial referent localization (Liu et al., 3 Mar 2025).
Ordinal preference modeling establishes feedback sets , e.g., , supporting soft, probabilistic annotation. The reward model is learned to minimize cross-entropy or hinge losses reflecting modeled probabilities of pairwise preference, strictly reducing Rademacher complexity and improving sample efficiency (Liu et al., 2024).
Image Super-Resolution (ISR) employs:
- Encoder–Decoder architecture yielding:
(Liu et al., 27 Dec 2025). Training uses dense map reconstruction, triplet ranking, and anchor alignment losses.
Video Perceptual Modeling integrates:
- Comparative reward quantifies an improvement in response quality on original vs. degraded videos:
with auxiliary contrastive losses aligning visual and textual representations to boost sensitivity to transient events (Zhao et al., 24 Nov 2025).
Unified Perceptual-level Modeling (UniPercept) defines domains (Image Aesthetics Assessment—IAA, Image Quality Assessment—IQA, Image Structure & Texture Assessment—ISTA) and decomposes categories into interpretable sub-criteria; reward heads produce scalar scores (VR) or discrete judgments (VQA) (Cao et al., 25 Dec 2025).
2. Architectural Strategies and Implementation
FinPercep-RM architectures vary by domain, but canonical approaches are:
- Encoder–Decoder: Used in ISR, where the encoder () extracts multi-scale no-reference IQA features, and the decoder () upsamples/fuses features to output fine-grained spatial maps.
- Joint Vision–LLMs: InternVL3–8B and Qwen2-VL serve as backbones in UniPercept and Visual-RFT, with perceptual heads attached for scalar or categorical outputs. Cross-modal modules (Q-Former/token projection) align visual and textual streams.
- Quality Comparator Heads: VideoPerceiver adds an entailment or exact-match head for reward computation, operating on interleaved visual/textual tokens (Zhao et al., 24 Nov 2025).
- Prompt Engineering: Structured output schemas (e.g., > …<answer>…) and explicit reasoning interfaces are enforced for robust parsing and reward extraction (Liu et al., 3 Mar 2025).
Pseudocode for RL integration often follows Group Relative Policy Optimization (GRPO) recipes with batch sampling, reward computation, relative normalization, KL constraints, and reference policy updates. Curriculum learning frameworks co-evolve the reward model and generator, phasing from global to fine-grained feedback for stability (Liu et al., 27 Dec 2025).
3. Learning Objectives and Optimization Algorithms
FinPercep-RM supports several learning paradigms:
Deterministic reward application: Direct formulaic computation for detection/classification/grounding rewards (Liu et al., 3 Mar 2025).
- Generalized Bradley–Terry cross-entropy / hinge losses: For ordinal feedback,
with (Liu et al., 2024).
- Triplet ranking and dense reconstruction: For image super-resolution, objectives combine map L1 errors, ranking constraints on global scores, and alignment of ground-truth scales (Liu et al., 27 Dec 2025).
- Contrastive losses: Video models employ InfoNCE losses between full, dropout, and degraded video/text embeddings to focus on fine-grained temporal cues (Zhao et al., 24 Nov 2025).
- Adaptive reward aggregation: UniPercept employs Gaussian soft rewards for scalar ratings and discrete indicators for accuracy, and combines domain-specific scores at generation (Cao et al., 25 Dec 2025).
Policy optimization typically integrates these rewards via GRPO, PPO, DPO, or REFL, incorporating KL penalties and advantage normalization. Curriculum learning stages gradually introduce fine-grained reward signals to stabilize RL convergence (Liu et al., 27 Dec 2025).
4. Datasets and Ground-Truth Construction
FinPercep-RM training requires datasets with rich, localized, or ordinal annotations:
- FGR-30k (ISR): Synthesized by swapping SR/GT regions under diverse masks, with ground-truth maps fused from pixel and feature-level differences (DINOv3 features) (Liu et al., 27 Dec 2025).
- VideoPerceiver-80K: Clips from HMDB51, CelebV-HQ, MM-AU capturing transient events (<1 s, ≤5% total frames), annotated with dense captions, QA pairs, and temporal groundings (Zhao et al., 24 Nov 2025).
- UniPercept-Bench: Hierarchical domains/categories/criteria spanning VR and VQA, sourced from ArtiMuse-10K, KonIQ-10K, AVA, ISTA-10K, and other image benchmarks, with taxonomy for aesthetics, quality, and structure/texture ratings (Cao et al., 25 Dec 2025).
- Ordinal feedback datasets: Human preference data mapped to numerical scales, maintained with guidance to anchor textual phrases to probabilities for marginal unbiasedness (Liu et al., 2024).
This abundance of fine-grained or ordinal labels is critical to enable the reward models to provide high-resolution signals for policy updating.
5. Experimental Validation and Performance Metrics
Empirical results demonstrate substantial improvement over baselines:
Task / Domain Metric SFT Baseline FinPercep-RM Gain Reference 1-shot Fine-grained Cls Accuracy (Qwen2-VL-2B) 51.7% 80.3% +24.3 (Liu et al., 3 Mar 2025) Few-shot COCO Detect. mAP 19.5 33.6 +14.0 (Liu et al., 3 Mar 2025) Open-vocab COCO Detect. mAP_n 13.6 31.3 +17.7 (Liu et al., 3 Mar 2025) SR (DiffBIR/DrealSR) MUSIQ 65.67 67.23 +1.56 (Liu et al., 27 Dec 2025) SR (RealLQ250/DIT4SR) MANIQA 0.615 0.658 +0.043 (Liu et al., 27 Dec 2025) Video MotionBench Avg. score 0.56 0.69 +0.13 (Zhao et al., 24 Nov 2025) Transient VQA (VRU) Avg. key event acc. (%) 52.1 73.7 +21.6 (Zhao et al., 24 Nov 2025) UniPercept VR (SRCC) IAA/IQA/ISTA – +5–10 (vs. variants) – (Cao et al., 25 Dec 2025) Ablations confirm the necessity of feature-level supervision, fine-grained curriculum, comparative reward, contrastive pretraining, and hierarchical annotation. Results reveal FinPercep-RM’s ability to dramatically improve sample efficiency, robustness in few-shot/out-of-domain scenarios, and alignment with human perceptual judgment.
6. Implications, Limitations, and Future Perspectives
FinPercep-RM advances reinforcement fine-tuning by shifting from global, coarse, or binary reward schemes toward rich, localized, and verifiable feedback. Its broad applicability spans object detection, classification, spatial and temporal grounding, image super-resolution, aesthetic/quality evaluation, and policy learning from human preferences.
Current limitations include the complexity of training and the instability induced by high-variance local rewards, necessitating curriculum learning and hybrid global–local scheduling (Liu et al., 27 Dec 2025). Human-in-the-loop annotation must be carefully calibrated for unbiasedness (Liu et al., 2024), and reward hacking may still occur if curriculum or aggregation is not well-controlled.
A plausible implication is that as multimodal models scale, both deterministic and learned fine-grained perceptual rewards will be essential for domain adaptation, interpretability, and human alignment. Extensions to additional modalities and further optimization of curriculum strategies are likely future research directions.
FinPercep-RM’s principles have already demonstrated notable gains across benchmarks with SFT and reward-driven RL paradigms, underscoring its central role in contemporary model alignment and generalization.