IA-RFT: Identity Aesthetic Reward Fine-Tuning
- IA-RFT is a strategy that fine-tunes generative diffusion models by using composite rewards to preserve identity and optimize aesthetic quality.
- It employs parameter-efficient techniques (e.g., LoRA or adapters) guided by facial similarity metrics and human-preference signals to adjust model outputs.
- Experimental results show enhanced identity fidelity, faster convergence, and improved visual appeal in tasks like text-to-image generation and face retouching.
Identity-Aesthetic Reward Fine-Tuning (IA-RFT) denotes a class of learning strategies for generative models where the fine-tuning objective simultaneously enforces identity preservation and optimizes for aesthetic quality. IA-RFT achieves this by introducing composite reward functions—derived from both facial identity similarity and human aesthetic preference signals—to guide the adaptation of a generative backbone, typically a diffusion model, with parameter-efficient methods such as LoRA or adapters. Key works such as ID-Aligner (Chen et al., 2024) and BeautyGRPO (Yang et al., 1 Mar 2026) have formalized and demonstrated IA-RFT’s effectiveness in text-to-image generation and face retouching, respectively.
1. Formulation of the IA-RFT Objective
IA-RFT begins with a pre-trained generative diffusion model (commonly UNet-parameterized with VAE encoder/decoder), and targets the fine-tuning of a restricted set of weights (e.g., LoRA layers, adapter blocks). The dataset comprises text prompts and one or a few reference identity images . The goal is to tune such that images sampled under display both high-fidelity identity retention with respect to and high aesthetic appeal.
For text-to-image, the fine-tuning loop involves denoising a Gaussian latent for steps without gradients, then one further step with gradient tracking to yield a predicted latent , which is VAE-decoded to . Two key reward functions are defined:
- Identity-consistency reward
- Identity-aesthetic reward
Loss formulation: The total fine-tuning loss is
For LoRA-based fine-tuning, a standard denoising MSE term is also included.
For face retouching, BeautyGRPO adopts a reinforcement learning (RL) paradigm using a Markov Decision Process (MDP) over the generative sampling trajectory, where reward is computed by a fine-grained aesthetic/identity reward model only at the final output (Yang et al., 1 Mar 2026).
2. Reward Design: Identity and Aesthetic Components
IA-RFT crucially depends on reward models that can encode nuanced perceptual and identity signals.
In ID-Aligner (Chen et al., 2024):
- is trained starting from a pretrained ImageReward model, and further fine-tuned on a dataset of human-judged (prompt, image, image) triplets using a pairwise logistic loss. This sub-reward measures the overall human-preferred appeal under the textual context.
- is trained using a dataset constructed from real face+body images as positives and synthetic, structure-perturbed variants as negatives. This sub-reward explicitly penalizes structurally implausible (e.g., anatomically distorted) generations.
In BeautyGRPO (Yang et al., 1 Mar 2026):
- The reward model ingests the input–output pair , extracting features via a ViT/CLIP backbone augmented with a fixed ArcFace/FaceNet embedding to increase identity sensitivity. The final reward is a scalar, equipped with an optional chain-of-thought block for per-dimension reasoning across “SkinSmoothing,” “BlemishRemoval,” “TextureQuality,” “Clarity,” “IdentityPreservation.”
- Rewards are learned on a human- and VLM-annotated dataset (FRPref-10K), using structured instruction-tuning and direct preference optimization (DPO or GRPO variants).
3. Fine-Tuning Pipeline and Algorithmic Integration
IA-RFT operates in a reward-weighted gradient descent regime. In the ID-Aligner framework (Chen et al., 2024), the fine-tuned parameters are exclusively those unlocked for LoRA or adapters. Gradients are backpropagated only through a single denoising step per iteration where non-identity/fixed layers are frozen. The combined reward-driven loss steers both identity retention and aesthetic improvement.
Pseudocode for the Adapter setting is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
initialize Adapter weights w₀ for each iteration i = 1…N: sample (prompt c, ref_face) from D x_T ← N(0,I) choose t ∈ [T₁, T₂] uniformly # no‐grad denoising to step t for j=T…t+1: x_{j−1} ← UNet_{w_i}(x_j, c) # no grad # one grad‐tracked step x_{t−1} ← UNet_{w_i}(x_t, c) # with grad x′₀ ← scheduler.predict_noise_free(x_{t−1}, t) img′ ← VAE.decode(x′₀) # compute identity reward face_crop′ ← FaceDet(img′) emb′ ← FaceEnc(face_crop′) emb_ref ← FaceEnc(FaceDet(ref_face)) r_sim ← cosine_sim(emb′, emb_ref) L_id_sim ← 1 − r_sim # compute aesthetic reward r_appeal ← RewardNet_appeal(img′,c) r_struct ← RewardNet_struct(img′,c) L_id_aes ← −(r_appeal + r_struct) # combine and backprop L_total ← α₁·L_id_sim + α₂·L_id_aes w_{i+1} ← w_i − η·∇₍w₎L_total |
In BeautyGRPO, online RL (GRPO or DPO) directly optimizes the generation policy to maximize the reward at terminal step, with Dynamic Path Guidance (DPG) stabilizing trajectory sampling (Yang et al., 1 Mar 2026).
4. Dynamic Path Guidance and Fidelity Constraints
The application of RL to high-fidelity generative models introduces a fidelity-exploration trade-off, as RL’s stochastic exploration may cause drift or artifacts. BeautyGRPO (Yang et al., 1 Mar 2026) addresses this through Dynamic Path Guidance (DPG):
- At each reverse step in the sampler, an anchor-based ODE path is computed targeting a high-preference exemplar from FRPref-10K.
- The noise term for stochastic update is linearly interpolated between standard Gaussian and an anchor-derived value, with interpolation factor annealing from $1$ to $0$, strongly guiding initial steps toward the anchor and relaxing at later stages.
- DPG corrects stochastic drift, enabling exploration for RL credit assignment, but restricts deviation from high-fidelity retouching necessary for realistic face outputs.
5. Experimental Results and Evaluation
Empirical findings from both ID-Aligner and BeautyGRPO demonstrate the impact of IA-RFT.
| Metric | SD1.5 Adapter (Base/IP-Adapter → ID-Aligner) | SDXL Adapter (Base → ID-Aligner) | BeautyGRPO (FFHQR) |
|---|---|---|---|
| FaceSim (↑) | 0.739 → 0.800 | 0.512 → 0.619 | ArcFace 0.952 |
| CLIP-I (↑) | 0.684 → 0.727 | 0.541 → 0.602 | NIMA 5.12 |
| LAION-Aesthetics (↑) | 5.54 → 5.59 | 5.85 → 5.88 | MUSIQ 4.91 |
| DINO (↑) | 0.586 → 0.606 | 0.497 → 0.499 | NIQE 10.83 (↓) |
Additional findings:
- Adding only the identity reward recovers reference likeness but can yield structural artifacts, whereas the full IA-RFT loss additionally corrects limb/structural defects (Chen et al., 2024).
- IA-RFT accelerates LoRA training convergence by 2–3× for similar identity preservation targets.
- User preference studies show superiority of IA-RFT for aesthetic and structure quality, ranking highest in aesthetic votes (33.6%) and competitive in face and text fidelity (Chen et al., 2024).
- In face retouching, BeautyGRPO win-rate is 63.25% against all baselines, with ablations showing significant drops in objective and subjective scores if the identity branch or DPG components are removed (Yang et al., 1 Mar 2026).
6. Implementation Details and Practical Considerations
Critical optimization hyperparameters for effective IA-RFT include learning rate ( for Adapter), batch size (typically 32), total update steps (up to for diffusion-based pipelines), and reward weighting factors (, in ID-Aligner). For RL-based face retouching (BeautyGRPO), LoRA is used with AdamW, batch sizes of 2–4, and a DPG schedule (three DPG steps per trajectory).
For reward model training in BeautyGRPO:
- Stage 1: Structured Reasoning SFT trains both dimension scores and preference labels.
- Stage 2: Self-training with consistency filtering, where pseudo-labels are filtered on preference correctness and chain-of-thought coherence.
- Stage 3: Preference RL with GRPO objective, optimizing directly for human-valid reasoning and final preference alignment.
At inference time, DPG is omitted and standard ODE sampling restores maximal output fidelity.
7. Significance and Limitations
IA-RFT represents a paradigm shift from supervised, pixel-level learning toward feedback-driven adaptation guided by composite perceptual and identity-aware rewards. This approach is robust to subjective human preference diversity and outperforms strictly supervised or single-reward fine-tuning, as evidenced by experimental metrics and human studies (Chen et al., 2024, Yang et al., 1 Mar 2026). The modularity of IA-RFT enables seamless integration into LoRA/Adapter architectures, broad generalization to diverse diffusion backbones, and addresses common artifacts (e.g., anatomical distortions, oversmoothing) not corrected by identity-only or MSE objectives.
Current limitations can include the coverage and fidelity of the curated reward datasets, the complexity introduced by reward model training pipelines, and the computational demands of multi-stage fine-tuning. Dynamic path guidance mechanisms offer an effective means to balance RL-driven preference alignment with preservation of visual identity and output naturalness.
A plausible implication is that future IA-RFT systems may further benefit from expanding reward modeling to cover a broader range of subjective and semantic criteria, including explicit bias or demographic fairness controls, and that dynamic trajectory interventions (such as DPG) will remain crucial for high-fidelity RL in generative image tasks.