Reward Fine-Tuning for Identity Consistency
- Identity Consistency Reward Fine-Tuning is a reinforcement learning-driven method that employs explicit, identity-focused reward functions to maintain stable outputs across modalities.
- The approach demonstrates significant empirical gains, such as a 117% relative improvement in identity similarity for multi-person video synthesis and other tasks.
- It integrates techniques like DPO, ReFL, and GRPO with metrics based on cosine similarity and bipartite matching to accurately quantify and optimize identity consistency.
Identity Consistency Reward Fine-Tuning is a set of reinforcement learning (RL)-driven post-training protocols designed to explicitly steer generative models—spanning vision, video, and language domains—toward outputs that robustly preserve identity information. “Identity consistency” refers to the model’s ability to generate outputs (images, videos, text) that maintain the same entities (faces, objects, characters, personas, personal style) stably over time, across modalities, or in the presence of reference exemplars. The central innovation is the construction of an identity-focused reward function, against which models are directly fine-tuned using RL or reward-feedback learning. This paradigm has demonstrated significant improvements over conventional supervised methods for tasks such as visual story grounding, text-to-image generation, face restoration, multi-person video synthesis, and simulation of user personas in LLMs (Oliveira et al., 9 Jul 2025, Chen et al., 23 Apr 2024, Shen et al., 16 Oct 2025, Wu et al., 23 May 2025, Cheng et al., 8 Sep 2025, Abdulhai et al., 31 Oct 2025, Meng et al., 16 Oct 2025).
1. Foundations and Motivation
State-of-the-art generative architectures—including diffusion UNets, flow-matching transformers, and LLMs—often struggle to maintain consistent identity signals. In vision, this leads to identity drift or confusion when generating faces, characters, or multi-subject compositions; in textual and sequential domains, it manifests as referential errors or persona inconsistencies. Traditional reconstruction- or CLIP-based losses provide weak or diffuse training signals with respect to identity, motivating the explicit use of identity consistency reward objectives.
These reward functions are typically computed via pretrained or fine-tuned embedding models (e.g., FaceNet, ArcFace, VLM encoders, or LLMs serving as consistency oracles) and operate by measuring similarity between generated and reference identities, or by quantifying the presence and accurate linking of entities across outputs. This formalization enables direct, target-driven fine-tuning of generators to maximize expected reward, using policy gradient or reward-feedback RL schemes adapted for offline and gradient-based settings (Chen et al., 23 Apr 2024, Shen et al., 16 Oct 2025, Oliveira et al., 9 Jul 2025).
2. Reward Function Design and Mathematical Formalism
Visual Identity & Face Consistency
In image and video domains, identity rewards are computed using cosine similarity in embedding space. For single-face scenarios: where and are face embeddings extracted from the generated and reference images, respectively (Chen et al., 23 Apr 2024).
In multi-identity contexts, e.g., UMO, a bipartite matching scheme is applied:
- Detected reference faces and generated faces are compared via a face embedding network ,
- The assignment matrix maximizing the total matched similarity is determined by the Hungarian algorithm,
- The multi-identity matching reward (MIMR) is then: with for correct assignments and penalizing confusion (Cheng et al., 8 Sep 2025).
Language & Storytelling Consistency
For entity consistency across frames (visual storytelling), dual rewards are used:
- Entity Re-ID Reward: Computes persistence of character/object references across frames, weighted by importance: with defined as normalized frame appearance rates; control balance.
- Grounding Reward: Evaluates the precision of mapping pronouns/proper-nouns to unique entities: where counts grounded mentions and totals.
LLM-Based Consistency for Personas
For simulated user identities or personas, rewards are based on three LLM-judged binary metrics:
- Prompt-to-Line Consistency:
- Line-to-Line Consistency: Checks for contradiction between turns.
- Q&A Consistency: Stability of factual persona responses (Abdulhai et al., 31 Oct 2025).
3. RL and Reward-Feedback Fine-Tuning Algorithms
Direct Preference Optimization (DPO)
Used for sequence generation (e.g., visual storytelling), DPO [Rafailov et al. '23] consumes preference pairs and loss: where is a frozen base model; preference order is determined by the identity reward (Oliveira et al., 9 Jul 2025).
Reward Feedback Learning (ReFL) for Diffusion
Adapter- or LoRA-based diffusion models are fine-tuned to maximize identity reward, typically with truncated gradient flow:
- Loss ,
- Optionally combined with reconstruction or aesthetic losses for stability,
- Gradient is back-propagated through the (frozen) VAE decoder and a limited number of final denoising steps (Chen et al., 23 Apr 2024, Wu et al., 23 May 2025).
GRPO and PPO Variants for Video and Text
Group Relative Policy Optimization (GRPO) and PPO-based updates normalize advantages within sampled groups, employ ratio clipping, and may omit explicit value functions or critics: with the reward and advantage derived from the identity consistency predictor (Meng et al., 16 Oct 2025, Abdulhai et al., 31 Oct 2025).
4. Negative Sampling, Data Construction, and Regularization
Contrastive learning improves robustness:
- Synthetic negatives (incoherent frames from unrelated sources) are injected to teach models when to avoid linking unrelated entities (Oliveira et al., 9 Jul 2025).
- In video and face restoration, large hybrid datasets are constructed: human-annotated pairs, synthetic distortions, and filtering via automated metrics (e.g., CLIP similarity control) (Meng et al., 16 Oct 2025, Wu et al., 23 May 2025).
- Dynamic reward model optimization and periodic update cycles counteract reward-hacking and adapt reward functions to evolving generator outputs (Wu et al., 23 May 2025).
Regularization is critical to avoid overfitting:
- KL divergence penalties constrain deviation from pretrained policy distributions (Shen et al., 16 Oct 2025).
- In ReFL schemes, weight regularization prevents loss of generative diversity.
5. Quantitative Evaluation and Empirical Results
Identity consistency reward fine-tuning consistently offers substantial empirical gains across modalities:
| Task/Domain | Baseline Metric | Reward-Tuned Metric | Relative Gain | Reference |
|---|---|---|---|---|
| Visual Storytelling | Grounding mAP: 0.27 | 0.31 (+14.8%) | +14.8% | (Oliveira et al., 9 Jul 2025) |
| Text-to-Image | FaceSim: 0.739 | 0.800 | +8.2% | (Chen et al., 23 Apr 2024) |
| I2V Generation | FaceSim: 0.477 | 0.696 | +45.9% | (Shen et al., 16 Oct 2025) |
| Multi-ID Custom. | ID-Sim: 31.82 | 69.09 | +117% | (Cheng et al., 8 Sep 2025) |
| Persona RLHF | Consistency: 0.619 (OE) | 0.981 | +58.5% | (Abdulhai et al., 31 Oct 2025) |
| Multi-Human Video | ID Cons.: 2.606 (VACE) | 3.099 | +18.9% | (Meng et al., 16 Oct 2025) |
Gains in identity consistency are typically achieved with minimal or no degradation in other metrics (text/image fidelity, generation diversity). Human studies confirm subjective improvements, particularly in scenarios with small or occluded faces, complex character dynamics, and long dialogue sequences.
6. Architectural and Training Considerations
- Identity reward computation is modular: reward heads are frozen during generator updates, backpropagation is truncated for efficiency, and only adapter/LoRA parameters are updated in most workflows (Chen et al., 23 Apr 2024, Cheng et al., 8 Sep 2025).
- Large batch/group sizes and diversification of initial noise provide stability during gradient-based RL.
- Bipartite matching and scaffolded embedding networks are used for scalable multi-identity reward calculation.
- Integration with existing architectures (e.g., Stable Diffusion, Qwen Storyteller, VACE, Phantom) is achieved with minimal modifications, facilitating broad applicability (Oliveira et al., 9 Jul 2025, Meng et al., 16 Oct 2025).
7. Limitations, Open Challenges, and Future Directions
- Overfixed identity constraints can slightly degrade other objectives, such as text-prompt fidelity (Chen et al., 23 Apr 2024).
- Sensitivity to biases (e.g., face recognition accuracy across demographics, pose) persists.
- Current reward functions focus on static or session-level consistency; temporally aware and evolution-tolerant identity metrics are underexplored (Abdulhai et al., 31 Oct 2025).
- Scaling to very large numbers of identities remains non-trivial: context length and capacity induce diminishing returns; base model “in-context” limits can be a bottleneck (Cheng et al., 8 Sep 2025).
- Robust joint optimization of identity, global structure, and style is an open research question.
A plausible implication is that future work will increasingly combine identity consistency rewards with multi-aspect RL (style, structure, semantics) in generative models, employing dynamic and human-in-the-loop reward updates to guarantee both fidelity and diversity across complex real-world scenarios.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free