Reward Fine-Tuning for Identity Consistency

Updated 17 November 2025

Identity Consistency Reward Fine-Tuning is a reinforcement learning-driven method that employs explicit, identity-focused reward functions to maintain stable outputs across modalities.
The approach demonstrates significant empirical gains, such as a 117% relative improvement in identity similarity for multi-person video synthesis and other tasks.
It integrates techniques like DPO, ReFL, and GRPO with metrics based on cosine similarity and bipartite matching to accurately quantify and optimize identity consistency.

Identity Consistency Reward Fine-Tuning is a set of reinforcement learning (RL)-driven post-training protocols designed to explicitly steer generative models—spanning vision, video, and language domains—toward outputs that robustly preserve identity information. “Identity consistency” refers to the model’s ability to generate outputs (images, videos, text) that maintain the same entities (faces, objects, characters, personas, personal style) stably over time, across modalities, or in the presence of reference exemplars. The central innovation is the construction of an identity-focused reward function, against which models are directly fine-tuned using RL or reward-feedback learning. This paradigm has demonstrated significant improvements over conventional supervised methods for tasks such as visual story grounding, text-to-image generation, face restoration, multi-person video synthesis, and simulation of user personas in LLMs (Oliveira et al., 9 Jul 2025, Chen et al., 2024, Shen et al., 16 Oct 2025, Wu et al., 23 May 2025, Cheng et al., 8 Sep 2025, Abdulhai et al., 31 Oct 2025, Meng et al., 16 Oct 2025).

1. Foundations and Motivation

State-of-the-art generative architectures—including diffusion UNets, flow-matching transformers, and LLMs—often struggle to maintain consistent identity signals. In vision, this leads to identity drift or confusion when generating faces, characters, or multi-subject compositions; in textual and sequential domains, it manifests as referential errors or persona inconsistencies. Traditional reconstruction- or CLIP-based losses provide weak or diffuse training signals with respect to identity, motivating the explicit use of identity consistency reward objectives.

These reward functions are typically computed via pretrained or fine-tuned embedding models (e.g., FaceNet, ArcFace, VLM encoders, or LLMs serving as consistency oracles) and operate by measuring similarity between generated and reference identities, or by quantifying the presence and accurate linking of entities across outputs. This formalization enables direct, target-driven fine-tuning of generators to maximize expected reward, using policy gradient or reward-feedback RL schemes adapted for offline and gradient-based settings (Chen et al., 2024, Shen et al., 16 Oct 2025, Oliveira et al., 9 Jul 2025).

2. Reward Function Design and Mathematical Formalism

Visual Identity & Face Consistency

In image and video domains, identity rewards are computed using cosine similarity in embedding space. For single-face scenarios: $R_{\text{id}}(\hat{x}_0, x_0^{\text{ref}}) = \frac{\langle E_{\text{gen}}, E_{\text{ref}} \rangle}{\|E_{\text{gen}}\|\,\|E_{\text{ref}}\|} \in [-1, 1]$ where $E_{\text{gen}}$ and $E_{\text{ref}}$ are face embeddings extracted from the generated and reference images, respectively (Chen et al., 2024).

In multi-identity contexts, e.g., UMO, a bipartite matching scheme is applied:

Detected reference faces $\{F_i\}$ and generated faces $\{\hat{F}_j\}$ are compared via a face embedding network $\psi$ ,
The assignment matrix $P$ maximizing the total matched similarity is determined by the Hungarian algorithm,
The multi-identity matching reward (MIMR) is then: $R_{\text{MIMR}} = \frac{1}{MN} \sum_{i=1}^M\sum_{j=1}^N \left[\lambda_1 \mathbb{1}(j=\hat{\sigma}(i)) + \lambda_2 \mathbb{1}(j\ne\hat{\sigma}(i))\right]\, e_{i,j}$ with $\lambda_1>0$ for correct assignments and $\lambda_2<0$ penalizing confusion (Cheng et al., 8 Sep 2025).

Language & Storytelling Consistency

For entity consistency across frames (visual storytelling), dual rewards are used:

Entity Re-ID Reward: Computes persistence of character/object references across frames, weighted by importance: $R_{\text{reid}}(c, r) = \begin{cases} \alpha R_{\text{char}} + \beta R_{\text{obj}}, & r = \text{real} \ 1.0 - (\alpha R_{\text{char}} + \beta R_{\text{obj}}), & r = \text{synth} \end{cases}$ with $R_{\text{char}}, R_{\text{obj}}$ defined as normalized frame appearance rates; $\alpha, \beta$ control balance.
Grounding Reward: Evaluates the precision of mapping pronouns/proper-nouns to unique entities: $R_{\text{ground}}(s) = \gamma \frac{G_{\text{char}}}{T_{\text{char}}} + \delta \frac{G_{\text{obj}}}{T_{\text{obj}}}$ where $G$ counts grounded mentions and $T$ totals.

LLM-Based Consistency for Personas

For simulated user identities or personas, rewards are based on three LLM-judged binary metrics:

Prompt-to-Line Consistency: $C_{\text{prompt-to-line}} = \frac{1}{T} \sum_{t=1}^T J_{\text{LLM}}(P, r_t)$
Line-to-Line Consistency: Checks for contradiction between turns.
Q&A Consistency: Stability of factual persona responses (Abdulhai et al., 31 Oct 2025).

3. RL and Reward-Feedback Fine-Tuning Algorithms

Direct Preference Optimization (DPO)

Used for sequence generation (e.g., visual storytelling), DPO [Rafailov et al. '23] consumes preference pairs $(x, y_w, y_l)$ and loss: $L_{\text{DPO}}(\pi_\theta) = - \mathbb{E}[ \log \sigma(\beta(\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}))]$ where $\pi_{\text{ref}}$ is a frozen base model; preference order is determined by the identity reward (Oliveira et al., 9 Jul 2025).

Reward Feedback Learning (ReFL) for Diffusion

Adapter- or LoRA-based diffusion models are fine-tuned to maximize identity reward, typically with truncated gradient flow:

Loss $L_{\text{id}} = \mathbb{E}_{c,x'_0}[1 - R_{\text{id}}(x'_0, x^{\text{ref}}_0)]$ ,
Optionally combined with reconstruction or aesthetic losses for stability,
Gradient is back-propagated through the (frozen) VAE decoder and a limited number of final denoising steps (Chen et al., 2024, Wu et al., 23 May 2025).

GRPO and PPO Variants for Video and Text

Group Relative Policy Optimization (GRPO) and PPO-based updates normalize advantages within sampled groups, employ ratio clipping, and may omit explicit value functions or critics: $\mathcal{J}(\theta) = \frac{1}{G} \sum_{i=1}^G \frac{1}{T} \sum_{t=1}^T \min(\rho_t^i(\theta) \hat{A}^i, \operatorname{clip}(\rho_t^i(\theta), 1-\epsilon, 1+\epsilon) \hat{A}^i)$ with the reward and advantage derived from the identity consistency predictor (Meng et al., 16 Oct 2025, Abdulhai et al., 31 Oct 2025).

4. Negative Sampling, Data Construction, and Regularization

Contrastive learning improves robustness:

Synthetic negatives (incoherent frames from unrelated sources) are injected to teach models when to avoid linking unrelated entities (Oliveira et al., 9 Jul 2025).
In video and face restoration, large hybrid datasets are constructed: human-annotated pairs, synthetic distortions, and filtering via automated metrics (e.g., CLIP similarity control) (Meng et al., 16 Oct 2025, Wu et al., 23 May 2025).
Dynamic reward model optimization and periodic update cycles counteract reward-hacking and adapt reward functions to evolving generator outputs (Wu et al., 23 May 2025).

Regularization is critical to avoid overfitting:

KL divergence penalties constrain deviation from pretrained policy distributions (Shen et al., 16 Oct 2025).
In ReFL schemes, weight regularization prevents loss of generative diversity.

5. Quantitative Evaluation and Empirical Results

Identity consistency reward fine-tuning consistently offers substantial empirical gains across modalities:

Task/Domain	Baseline Metric	Reward-Tuned Metric	Relative Gain	Reference
Visual Storytelling	Grounding mAP: 0.27	0.31 (+14.8%)	+14.8%	(Oliveira et al., 9 Jul 2025)
Text-to-Image	FaceSim: 0.739	0.800	+8.2%	(Chen et al., 2024)
I2V Generation	FaceSim: 0.477	0.696	+45.9%	(Shen et al., 16 Oct 2025)
Multi-ID Custom.	ID-Sim: 31.82	69.09	+117%	(Cheng et al., 8 Sep 2025)
Persona RLHF	Consistency: 0.619 (OE)	0.981	+58.5%	(Abdulhai et al., 31 Oct 2025)
Multi-Human Video	ID Cons.: 2.606 (VACE)	3.099	+18.9%	(Meng et al., 16 Oct 2025)

Gains in identity consistency are typically achieved with minimal or no degradation in other metrics (text/image fidelity, generation diversity). Human studies confirm subjective improvements, particularly in scenarios with small or occluded faces, complex character dynamics, and long dialogue sequences.

6. Architectural and Training Considerations

Identity reward computation is modular: reward heads are frozen during generator updates, backpropagation is truncated for efficiency, and only adapter/LoRA parameters are updated in most workflows (Chen et al., 2024, Cheng et al., 8 Sep 2025).
Large batch/group sizes and diversification of initial noise provide stability during gradient-based RL.
Bipartite matching and scaffolded embedding networks are used for scalable multi-identity reward calculation.
Integration with existing architectures (e.g., Stable Diffusion, Qwen Storyteller, VACE, Phantom) is achieved with minimal modifications, facilitating broad applicability (Oliveira et al., 9 Jul 2025, Meng et al., 16 Oct 2025).

7. Limitations, Open Challenges, and Future Directions

Overfixed identity constraints can slightly degrade other objectives, such as text-prompt fidelity (Chen et al., 2024).
Sensitivity to biases (e.g., face recognition accuracy across demographics, pose) persists.
Current reward functions focus on static or session-level consistency; temporally aware and evolution-tolerant identity metrics are underexplored (Abdulhai et al., 31 Oct 2025).
Scaling to very large numbers of identities remains non-trivial: context length and capacity induce diminishing returns; base model “in-context” limits can be a bottleneck (Cheng et al., 8 Sep 2025).
Robust joint optimization of identity, global structure, and style is an open research question.

A plausible implication is that future work will increasingly combine identity consistency rewards with multi-aspect RL (style, structure, semantics) in generative models, employing dynamic and human-in-the-loop reward updates to guarantee both fidelity and diversity across complex real-world scenarios.