Identity-GRPO: Multi-Human Identity Preservation
- Identity-GRPO is a reinforcement learning framework that preserves multi-human identities in video generation using human feedback and specialized GRPO optimization.
- It employs a custom video reward model trained on pairwise comparisons, combining synthetic and human annotations to drive enhanced policy updates.
- The approach achieves significant gains in identity consistency, exhibiting up to an 18.9% improvement over baselines and high user-preference rates.
Identity-GRPO is a reinforcement learning framework designed to optimize multi-human identity preservation in video generation, particularly for settings where multiple human subjects interact dynamically and where subject identity consistency across a sequence is crucial. The approach introduces a human feedback-driven policy optimization pipeline that leverages a GRPO (Group Relative Policy Optimization) variant specialized for evaluating and improving multi-person identity fidelity. This pipeline includes a custom video reward model trained on large-scale pairwise preference data and applies targeted modifications to exploration and learning dynamics during diffusion-based video synthesis. The framework is demonstrated to be highly effective at improving identity consistency metrics over strong baselines.
1. Video Reward Model Construction
The core of Identity-GRPO is a reward model trained to capture human judgments of identity consistency in video. Supervision derives from a comprehensive dataset of preference annotations, combining both human-annotated comparisons and synthetic distortion data:
- Synthetic data: Videos are generated via advanced video diffusion models (e.g., VACE, Phantom, MAGREF) and processed with a filtering pipeline that selects preference pairs (e.g., “original vs generated”) using a multi-modal embedding model (e.g., GME) for automated, scalable annotation.
- Human-annotated data: Human annotators view both reference images, textual prompts, and two candidate videos, making pairwise judgments on which candidate better preserves individual identities through the sequence.
- Pairwise supervision and the BTT model: The reward model is trained using pairwise labels interpreted through a Bradley-Terry-with-Ties (BTT) framework, which accommodates across-the-board comparisons while explicitly modeling ties via a parameter θ (e.g., 5). The probability that candidate yᴬ is preferred over yᴮ (given context x and prompt t) is modeled as
where is the model’s scoring output for the video.
Empirical studies reveal that combining filtered synthetic and high-quality human-labeled data with a “smooth” sampling schedule (cosine-decaying human/synthetic balance over training epochs) yields superior reward model validity.
2. GRPO Variant for Multi-Human Consistency
Applying conventional GRPO to multi-human identity preservation tasks leads to suboptimal training due to high inter-sample variance and entanglement between prompt and visual cues. Identity-GRPO introduces several architectural and training modifications:
- Prompt Finetuning: Employed via a language-vision model (e.g., Qwen2.5-VL-7B) to systematically phrase prompts that best represent all referenced subjects, improving alignment between textual input and identity attributes.
- Initial Noise Differentiation: Instead of using a single initial noise seed per sample group (as in some conventional diffusion setups), each sample in a group receives a different noise initialization, increasing the diversity of generated identities and enhancing the ability to distinguish identity consistency.
- Large Group Sampling: Due to the high variance in this multimodal domain, the number of videos per policy update is increased (with group sizes such as 16), often by reducing temporal/spatial resolution, ensuring stability in advantage estimation.
- GRPO Objective Modification: The advantage for each video is defined as
and the update is performed using a clipped surrogate objective:
with being the likelihood ratio between the current and previous policy on a denoising step.
These modifications enhance robustness and exploration during policy updates for scenarios with multiple interacting humans and complex prompt conditioning.
3. Ablation Studies
Ablation experiments isolate the contributions of reward model annotation quality, data sampling, and training configurations:
- Reward model annotation: Models trained solely on human-annotated pairs are more accurate ($0.853$) than those using unfiltered synthetic data ($0.664$). Applying a filter (retaining only synthetic labels agreeing with an expert model) improves accuracy, and a “smooth” schedule combining human and synthetic annotations yields a peak ($0.890$).
- GRPO stability: Increasing group size (number of videos per update) and diversifying initializations both have pronounced positive impacts. For example, at group size 16 with differentiated initial noise, the identity consistency reward peaks (ID-Consistency $3.099$), versus $2.606$ for VACE baseline.
These findings establish that both data quality and group/exploration strategies in policy optimization are critical determinants of final identity preservation quality.
4. Performance Metrics
Evaluation uses metrics directly corresponding to identity consistency and human judgment:
- Identity-consistency reward score ("ID-Consistency"): Computed using the reward model described above, this score quantifies temporal coherence in identity across frames.
- Human user studies: Participants compare baseline (VACE, Phantom) outputs against Identity-GRPO counterparts; the “winning rate” is the percentage where Identity-GRPO is preferred.
- Empirical improvements: Identity-GRPO demonstrably improves ID-Consistency by up to over VACE and over Phantom. For example, VACE-1.3B's score improves from $2.606$ to $3.099$, with a user study winning rate of .
These demonstrate consistent, significant enhancement in identity preservation across all conducted evaluations.
5. Reinforcement Learning and MDP Alignment
Identity-GRPO frames video generation as a Markov Decision Process (MDP), aligning reinforcement learning optimization with personalized video outputs:
- State: consists of conditioning (text prompt and reference images), timestep , and video latent .
- Policy: is the diffusion model producing from , conditioned on .
- Reward: Provided only at the final denoising step (), using the human-aligned, pairwise preference-derived reward model.
- Update: The policy is optimized via GRPO, leveraging relative advantage normalization within each batch/group so that the denoising diffusion model is driven to maintain identity fidelity.
This RL-based alignment ensures that fine-tuning moves the model toward outputs preferred by users in terms of identity consistency, without sacrificing general video generation quality or requiring expensive discriminator models for every evaluation.
6. Significance, Applicability, and Future Directions
Identity-GRPO exemplifies a scalable approach for integrating human-aligned objectives into high-dimensional, generative video models for applications where personalized, multi-human identity preservation is essential. By combining preference-informed reward modeling, exploration-stabilized policy optimization, and carefully tailored GRPO variants, it advances the state of the art in personalized video synthesis.
Potential extensions include adaptation of the framework to broader personalized generation settings, further automated reward model refinements—e.g., leveraging improved vector models or multimodal teacher-student distillation for annotation efficiency—and incorporation of advanced exploration strategies for even more complex multi-subject scenarios.
This approach demonstrates how domain-tailored RL pipelines, grounded in robust group-relative estimation and human feedback, can address previously intractable challenges in video identity preservation.