Identity-Coherence Reinforcement Learning

Updated 4 July 2026

Identity-Coherence Reinforcement Learning is an approach where reinforcement learning is used to preserve an entity’s distinct identity across variations, interactions, and sequential decisions.
It operationalizes identity via measurable proxies such as feature-space consistency, temporal credit in multi-agent settings, and cross-modal alignment in generative tasks.
This methodology drives significant improvements in domains like privacy-sensitive synthesis, online speaker diarization, and cooperative multi-agent systems by balancing identity stability with task rewards.

Identity-Coherence Reinforcement Learning is a cross-cutting pattern in reinforcement learning in which the optimization target is not merely task reward or generic semantic correctness, but the preservation of an entity-specific identity under variation, interaction, or sequential decision making. The literature does not present a single canonical formalism under that exact name. Instead, several works instantiate closely related ideas in different domains: privacy-sensitive face recognition and person re-identification through diffusion-policy optimization (Jia et al., 9 Apr 2026); online speaker diarization as persistent identity assignment under feedback (Lin et al., 2023); credit-level individuality in cooperative MARL (Liu et al., 2022); observation-based emergence of individuality in MARL (Jiang et al., 2020); hidden-role identity inference in stochastic games (Han et al., 2022); behavioral camouflage of a privileged leader in multi-robot navigation (Deka et al., 2021); retrieval-based cross-modal identity mapping for captioning (Jia et al., 2 Mar 2026); and identity-preserving multi-subject image and video generation via reward-driven optimization (Wu et al., 26 Sep 2025, Wei et al., 12 Mar 2026, Meng et al., 16 Oct 2025). Across these settings, “identity coherence” is typically operationalized through persistent identity-conditioned consistency in feature space, trajectory-level role continuity, stable relation inference, or reference-preserving generation under nuisance variation rather than through a single universal coherence loss.

1. Conceptual scope and definitions

The strongest explicit near-instance of the topic is "Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition" (Jia et al., 9 Apr 2026). That work is best understood as a reinforcement-guided synthetic data generation method with a strong but mostly implicit notion of identity coherence: the task is to synthesize additional training images for known identities such that the synthetic images remain identity-consistent, span useful intra-class variation, and improve downstream recognition. In that setting, identity coherence is operationalized through feature-space consistency of generated samples with the target identity, plus controlled dispersion around that identity to cover pose, expression, or illumination variation without drifting away from the identity manifold (Jia et al., 9 Apr 2026).

Other papers broaden the concept. "A Reinforcement Learning Framework for Online Speaker Diarization" reframes diarization as a sequential identity-assignment problem in which the agent must reuse or create speaker labels over time under sparse feedback, but without a fully formalized memory or anti-switch objective (Lin et al., 2023). "Contrastive Identity-Aware Learning for Multi-Agent Value Decomposition" uses persistent agent identity as a supervisory signal on temporal credit vectors, so that different agents receive identity-distinguishable credits and thereby develop individualized cooperative roles (Liu et al., 2022). "The Emergence of Individuality" defines identity behaviorally through a classifier $P_\phi(I\mid O)$ that predicts which agent generated an observation, rewarding agents for visiting their own familiar observations (Jiang et al., 2020). "Identity Detection Reinforcement Learning" addresses hidden-role stochastic games where an agent must infer whether others are teammates or opponents from behavior, then select an appropriate policy under that uncertain social identity structure (Han et al., 2022).

In generative modeling, the same theme appears as preservation of reference subject identity under modality conversion or spatiotemporal transformation. "Cross-modal Identity Mapping" treats a caption as identity-preserving if it retains enough fine-grained image content that retrieval induced by the caption remains coherent and relevant to the source image (Jia et al., 2 Mar 2026). "MultiCrafter" and "DreamVideo-Omni" focus on multi-subject generation, where identity coherence means preventing subject mixing, attribute leakage, and role swapping while allowing diverse layouts or motion (Wu et al., 26 Sep 2025, Wei et al., 12 Mar 2026). "Identity-GRPO" makes the point most directly for multi-human video: the core problem is preserving the identity of each individual, the assignment between reference images and generated people, and temporal consistency of each subject across frames (Meng et al., 16 Oct 2025).

A plausible synthesis is that Identity-Coherence Reinforcement Learning is best treated as an umbrella for RL formulations in which an identity-bearing entity—person, speaker, agent, subject, role, or source image semantics—must remain stably recognizable while the policy induces variation, interaction, or sequential evolution.

2. Operationalizations of identity coherence

A recurring pattern is that coherence is rarely defined as a standalone abstract variable. It is usually encoded through measurable proxies tied to the task.

In privacy-sensitive identity synthesis, the central proxy is feature-space prototype alignment. For identity $y$ , a memory bank $\mathcal{B}_y=\{f_i\}_{i=1}^{N_y}$ defines a prototype

$\bar{f}_y = \frac{1}{N_y} \sum_{i=1}^{N_y}{f}_i, \quad \hat{f}_y = \frac{\bar{f}_y}{\| \bar{f}_y \|_2},$

and if $\hat f_g$ is the normalized feature of a generated image, semantic consistency is

$R_{\text{sem}} = \frac{1}{2}\left(\hat{f}_g^\top \hat{f}_y + 1\right).$

The same paper adds a coverage reward based on an RBF kernel over generated and reference features, and an expressive-diversity term based on covariance trace matching, so coherence means remaining near the identity center while covering plausible within-identity submodes (Jia et al., 9 Apr 2026).

In cooperative MARL, identity coherence appears as persistence of individualized contribution patterns. CIA defines per-agent temporal credit using

$x^k_t = \frac{\partial Q^{tot}_t}{\partial Q^k_t},$

aggregates it over a trajectory into $\boldsymbol{x}^k_\tau$ , and maximizes mutual information between that temporal credit vector and a learnable identity representation $\boldsymbol{w}^k$ via an InfoNCE-style loss (Liu et al., 2022). In EOI, coherence is weaker and more behavioral: the intrinsic reward is $p_\phi(i\mid o_i)$ , and a positive-distance regularizer encourages nearby observations from the same agent’s trajectory to induce similar identity predictions (Jiang et al., 2020).

In hidden-role and online identity-tracking tasks, coherence is tied to stable label assignment under partial observability. IDRL’s relation network outputs a confidence vector over whether other agents are teammates, while a danger network supplies a threshold determining whether those identity inferences are safe enough to act on (Han et al., 2022). Online speaker diarization similarly treats identity coherence as persistent reuse or expansion of speaker labels under reward, but the paper does not provide explicit speaker-memory equations, switch penalties, or temporal consistency losses (Lin et al., 2023).

In generative image and video work, coherence is frequently defined through matching to reference identities under transformations. MultiCrafter uses a Multi-ID Alignment Reward that builds a pairwise cosine similarity matrix between reference-face embeddings and generated-face embeddings, then solves a one-to-one Hungarian assignment: $y$ 0 which discourages identity collapse, duplication, and attribute leakage (Wu et al., 26 Sep 2025). DreamVideo-Omni instead learns a latent identity reward model that takes a noised video latent, a clean reference latent, and text conditioning, then predicts a scalar identity reward from cross-attended video and reference features (Wei et al., 12 Mar 2026). Identity-GRPO learns an identity-consistency reward from pairwise human and filtered synthetic preferences over whole videos rather than from ArcFace-style similarity alone (Meng et al., 16 Oct 2025).

Cross-modal captioning offers a different operationalization. CIM defines Gallery Representation Consistency

$y$ 1

and Query-gallery Image Relevance

$y$ 2

with reward

$y$ 3

so a caption is identity-preserving when it induces a coherent retrieved image cluster that also remains visually relevant to the source image (Jia et al., 2 Mar 2026).

3. Reinforcement-learning formulations

The RL machinery behind these methods is heterogeneous. Some works use classical MDPs; others use diffusion-as-policy or direct reward feedback on generative dynamics.

The clearest diffusion-policy formulation appears in privacy-sensitive identity synthesis. The generator is a conditional diffusion model $y$ 4, the objective is

$y$ 5

and, following DPOK, the policy gradient is written as

$y$ 6

The state is the noisy latent $y$ 7 plus condition $y$ 8, the action is the denoising transition, and reward is terminal and sample-level (Jia et al., 9 Apr 2026).

Online speaker diarization uses a conventional MDP notation

$y$ 9

with state as the current audio segment, action as the speaker label, and reward reflecting diarization quality. Its distinctive contribution is an extendable action space in which a special “new” action allows creation of previously unseen speaker identities, described as analogous to bandits with infinitely many arms (Lin et al., 2023).

In hidden-role stochastic games, IDRL retains the Markov-game setting but decomposes the problem into identity inference and policy selection. The acting policy is explicitly conditioned on local state, relation-network output, and danger-network output: $\mathcal{B}_y=\{f_i\}_{i=1}^{N_y}$ 0 The method is modular rather than end-to-end in a single unified RL objective: policy sets are trained, the identification module is trained, and then all networks are updated further using an intrinsic objective coupled to control performance (Han et al., 2022).

Multi-agent individuality methods modify the reward structure or critic structure rather than redefining the environment. EOI adds intrinsic reward $\mathcal{B}_y=\{f_i\}_{i=1}^{N_y}$ 1 on top of MAAC and QMIX (Jiang et al., 2020). CIA keeps value decomposition intact and adds a contrastive regularizer: $\mathcal{B}_y=\{f_i\}_{i=1}^{N_y}$ 2 so RL remains cooperative TD learning but with identity-sensitive credit shaping (Liu et al., 2022).

Generative video work uses group-relative policy optimization over denoising trajectories. Identity-GRPO models rectified-flow denoising as an MDP with

$\mathcal{B}_y=\{f_i\}_{i=1}^{N_y}$ 3

terminal reward on the completed video, and a GRPO objective with clipped likelihood ratios and group-relative normalized advantages (Meng et al., 16 Oct 2025). DreamVideo-Omni, by contrast, is explicitly not classical RL in a full MDP sense. It uses latent reward feedback learning: a frozen latent identity reward model scores an intermediate denoising step, and the generator is optimized by backpropagating

$\mathcal{B}_y=\{f_i\}_{i=1}^{N_y}$ 4

together with the stage-1 supervised diffusion loss (Wei et al., 12 Mar 2026). MultiCrafter uses online RL with a GSPO-style clipped sequence-level ratio over denoising windows rather than per-step ratios, motivated by instability from MoE routing fluctuations (Wu et al., 26 Sep 2025).

4. Reward design patterns

The subject is unified less by a common policy class than by a family of reward constructions that make identity preservation legible to optimization.

A representative summary is useful before the detailed discussion.

Setting	Identity signal	RL or reward mechanism
Identity synthesis	Prototype similarity, coverage, covariance expansion	DPOK policy gradient (Jia et al., 9 Apr 2026)
Multi-agent cooperation	Identity-predictive observations or temporal credits	Intrinsic reward or contrastive regularization (Jiang et al., 2020, Liu et al., 2022)
Multi-human generation	Preference-trained identity reward model	GRPO or latent reward feedback (Meng et al., 16 Oct 2025, Wei et al., 12 Mar 2026)

In privacy-sensitive identity synthesis, the reward is explicitly multi-objective. Reward components are standardized batchwise,

$\mathcal{B}_y=\{f_i\}_{i=1}^{N_y}$ 5

then combined as

$\mathcal{B}_y=\{f_i\}_{i=1}^{N_y}$ 6

with experimentally chosen coefficients $\mathcal{B}_y=\{f_i\}_{i=1}^{N_y}$ 7, $\mathcal{B}_y=\{f_i\}_{i=1}^{N_y}$ 8, and $\mathcal{B}_y=\{f_i\}_{i=1}^{N_y}$ 9. There is no explicit realism reward like FID and no classifier-confidence reward; realism is handled indirectly by a strong pretrained diffusion prior and cold-start adaptation (Jia et al., 9 Apr 2026).

In EOI, the reward is minimalistic: the classifier’s confidence that an observation belongs to the true agent. Its force comes from co-evolution between policy and classifier, plus two regularizers—positive distance and entropy-sharpening mutual-information motivation—rather than from a sophisticated return decomposition (Jiang et al., 2020). In CIA, the identity term is auxiliary rather than terminal reward: the critic is penalized unless temporal credits are predictable from agent identity (Liu et al., 2022).

IDRL’s intrinsic objective mixes control value and identification quality: $\bar{f}_y = \frac{1}{N_y} \sum_{i=1}^{N_y}{f}_i, \quad \hat{f}_y = \frac{\bar{f}_y}{\| \bar{f}_y \|_2},$ 0 so identity inference is not optimized for classification accuracy alone, but for usefulness in downstream cooperation-competition decisions (Han et al., 2022).

In generative video, reward modeling becomes a separate subfield. DreamVideo-Omni trains a Latent Identity Reward Model with binary cross-entropy on a preference-like dataset of identity-aligned and identity-misaligned videos, then uses it as a frozen differentiable scorer during fine-tuning (Wei et al., 12 Mar 2026). Identity-GRPO instead uses a Qwen2.5-VL-3B reward model trained with Bradley-Terry-with-Ties on pairwise preferences. For two candidate videos $\bar{f}_y = \frac{1}{N_y} \sum_{i=1}^{N_y}{f}_i, \quad \hat{f}_y = \frac{\bar{f}_y}{\| \bar{f}_y \|_2},$ 1 and $\bar{f}_y = \frac{1}{N_y} \sum_{i=1}^{N_y}{f}_i, \quad \hat{f}_y = \frac{\bar{f}_y}{\| \bar{f}_y \|_2},$ 2, the reward model induces probabilities of $\bar{f}_y = \frac{1}{N_y} \sum_{i=1}^{N_y}{f}_i, \quad \hat{f}_y = \frac{\bar{f}_y}{\| \bar{f}_y \|_2},$ 3 preferred, $\bar{f}_y = \frac{1}{N_y} \sum_{i=1}^{N_y}{f}_i, \quad \hat{f}_y = \frac{\bar{f}_y}{\| \bar{f}_y \|_2},$ 4 preferred, or tie through a tie-aware Bradley-Terry construction parameterized by $\bar{f}_y = \frac{1}{N_y} \sum_{i=1}^{N_y}{f}_i, \quad \hat{f}_y = \frac{\bar{f}_y}{\| \bar{f}_y \|_2},$ 5 (Meng et al., 16 Oct 2025). MultiCrafter uses a composite reward

$\bar{f}_y = \frac{1}{N_y} \sum_{i=1}^{N_y}{f}_i, \quad \hat{f}_y = \frac{\bar{f}_y}{\| \bar{f}_y \|_2},$ 6

where $\bar{f}_y = \frac{1}{N_y} \sum_{i=1}^{N_y}{f}_i, \quad \hat{f}_y = \frac{\bar{f}_y}{\| \bar{f}_y \|_2},$ 7 is identity-specific and differs for humans and objects (Wu et al., 26 Sep 2025).

A recurring controversy is proxy misspecification. The papers themselves note or imply that optimizing feature-space or reward-model proxies does not guarantee perfect human-perceived identity. The privacy-sensitive synthesis paper notes no explicit study of reward overfitting to the embedding model (Jia et al., 9 Apr 2026). DreamVideo-Omni shows classic reward-overoptimization failure at large $\bar{f}_y = \frac{1}{N_y} \sum_{i=1}^{N_y}{f}_i, \quad \hat{f}_y = \frac{\bar{f}_y}{\| \bar{f}_y \|_2},$ 8, where identity and motion metrics collapse (Wei et al., 12 Mar 2026). Identity-GRPO addresses reward-quality risk by filtering synthetic preference pairs through a human-trained teacher before mixing them into reward-model training (Meng et al., 16 Oct 2025).

5. Application domains

Identity-coherence RL is notable for appearing in domains that are superficially dissimilar but structurally parallel.

In recognition and retrieval, the goal is to create or refine observations that are maximally useful for identity discrimination. The privacy-sensitive synthesis paper targets Market-1501, CUHK03-NP, and a small subset of CASIA-WebFace, with downstream evaluation on LFW, AgeDB, CFP-FP, CA-LFW, CP-LFW, and RFW (Jia et al., 9 Apr 2026). IDEAL, an earlier person re-identification work, uses DQN to crop away background clutter from auto-detected person boxes under pairwise identity constraints, demonstrating that RL can refine observations so the resulting representation is more coherent with identity across views (Lan et al., 2017).

In streaming identity assignment, online speaker diarization treats “who is speaking now?” as a continual control problem with unknown speaker cardinality and feedback-driven adaptation (Lin et al., 2023). Hidden-role games ask “who is on my team?” and incorporate identity inference into policy switching (Han et al., 2022). Mixed-motive cooperation with indirect reciprocity shows a related issue at the population level: actions and reputation updates can depend on in-group versus out-group identity, and this can steer systems toward fair or unfair cooperation (Smit et al., 2024).

In cooperative MARL, the identity-bearing entity is the agent itself. CIA seeks identity-distinguishable credit assignment under global rewards (Liu et al., 2022). EOI encourages agents to occupy distinguishable observation regions and thereby develop division of labor (Jiang et al., 2020). Hiding a leader’s identity in leader-follower navigation inverts the usual objective: here the privileged agent’s behavior should become coherent with the rest of the team so that external observers cannot reliably infer who the leader is (Deka et al., 2021). This suggests a duality within the topic: some methods maximize identity distinguishability for internal control, while others minimize external identity distinguishability for safety or privacy.

In multimodal generation, identity coherence concerns semantic preservation across representation changes. CIM uses retrieval consistency as a reward for image-to-text conversion, arguing that a caption should remain coherent enough with the source image that it retrieves a visually consistent and relevant neighborhood (Jia et al., 2 Mar 2026). MultiCrafter studies multi-subject image generation, where the failure mode is attention bleeding and attribute leakage between reference subjects (Wu et al., 26 Sep 2025). DreamVideo-Omni and Identity-GRPO generalize the problem to multi-subject or multi-human video, where identity must persist under motion, camera change, and interaction (Wei et al., 12 Mar 2026, Meng et al., 16 Oct 2025).

A plausible implication is that the topic is less a domain-specific method family than a structural design principle: RL is used when standard supervised or reconstruction objectives do not adequately encode the requirement that an entity remain itself while everything else changes.

6. Empirical evidence, limitations, and open questions

The empirical record is positive but fragmented. On the recognition side, the privacy-sensitive synthesis framework reports $\bar{f}_y = \frac{1}{N_y} \sum_{i=1}^{N_y}{f}_i, \quad \hat{f}_y = \frac{\bar{f}_y}{\| \bar{f}_y \|_2},$ 9 mAP on Market-1501 versus $\hat f_g$ 0 for the ResNet-50 baseline, $\hat f_g$ 1 mAP on CUHK03, and $\hat f_g$ 2 average verification accuracy on the CASIA subset, with ablations showing incremental gains from dynamic sample selection, semantic consistency, coverage reward, and expressive diversity (Jia et al., 9 Apr 2026). IDEAL improves auto-detected re-identification close to manually cropped performance, with Rank-1 $\hat f_g$ 3 on CUHK03 auto-detected versus $\hat f_g$ 4 for manually cropped, and substantially outperforms prior methods on Market-1501 (Lan et al., 2017).

In MARL, CIA improves both QMIX and QPLEX on hard SMAC scenarios, with especially large gains where coordinated asymmetric roles matter, such as 3s5z_vs_3s6z and corridor (Liu et al., 2022). EOI shows that individuality can emerge before task reward and support division of labor, but also notes that if observation or trajectory cannot represent individuality, or if all agents receive the same full observation, the classifier-based mechanism is ineffective (Jiang et al., 2020). IDRL reports win rates of $\hat f_g$ 5 against MFRL, $\hat f_g$ 6 against CQL, and $\hat f_g$ 7 against Douzero in Red-10, with ablations indicating that both the identification module and the danger network matter (Han et al., 2022).

In generative modeling, DreamVideo-Omni reports improvements over DreamVideo-2 in both identity and motion metrics on DreamOmni Bench, including Face-S $\hat f_g$ 8 versus $\hat f_g$ 9, and user-study preferences of $R_{\text{sem}} = \frac{1}{2}\left(\hat{f}_g^\top \hat{f}_y + 1\right).$ 0 for subject fidelity and $R_{\text{sem}} = \frac{1}{2}\left(\hat{f}_g^\top \hat{f}_y + 1\right).$ 1 for overall quality (Wei et al., 12 Mar 2026). Identity-GRPO improves ID-Consistency from $R_{\text{sem}} = \frac{1}{2}\left(\hat{f}_g^\top \hat{f}_y + 1\right).$ 2 to $R_{\text{sem}} = \frac{1}{2}\left(\hat{f}_g^\top \hat{f}_y + 1\right).$ 3 on VACE-1.3B and from $R_{\text{sem}} = \frac{1}{2}\left(\hat{f}_g^\top \hat{f}_y + 1\right).$ 4 to $R_{\text{sem}} = \frac{1}{2}\left(\hat{f}_g^\top \hat{f}_y + 1\right).$ 5 on Phantom-1.3B, with human winning rates of $R_{\text{sem}} = \frac{1}{2}\left(\hat{f}_g^\top \hat{f}_y + 1\right).$ 6 and $R_{\text{sem}} = \frac{1}{2}\left(\hat{f}_g^\top \hat{f}_y + 1\right).$ 7, respectively (Meng et al., 16 Oct 2025). MultiCrafter reaches Face-Sim $R_{\text{sem}} = \frac{1}{2}\left(\hat{f}_g^\top \hat{f}_y + 1\right).$ 8 on multi-human generation and attributes much of its gain to structural attention disentanglement before RL fine-tuning (Wu et al., 26 Sep 2025). CIM reports a $R_{\text{sem}} = \frac{1}{2}\left(\hat{f}_g^\top \hat{f}_y + 1\right).$ 9 improvement in relation reasoning on Qwen2.5-VL-7B on COCO-LN500, specifically from $x^k_t = \frac{\partial Q^{tot}_t}{\partial Q^k_t},$ 0 to $x^k_t = \frac{\partial Q^{tot}_t}{\partial Q^k_t},$ 1 on Relations QA, which the paper interprets as better preservation of relation-bearing image semantics during captioning (Jia et al., 2 Mar 2026).

The limitations are equally consistent. Several works warn that identity coherence is usually optimized through proxies. The privacy-sensitive synthesis framework does not guarantee privacy, does not analyze memorization or leakage, and depends on the representational quality of the pretrained backbone and the reward-feature extractor (Jia et al., 9 Apr 2026). Speaker diarization as RL remains underspecified regarding explicit memory, anti-switch penalties, and temporal consistency regularizers (Lin et al., 2023). EOI encourages separable observation distributions rather than guaranteeing semantically stable identities (Jiang et al., 2020). DreamVideo-Omni is clear that its “latent identity reinforcement learning” is reward feedback learning rather than classical RL, and shows reward hacking at excessive reward weight (Wei et al., 12 Mar 2026). Identity-GRPO does not provide a full trade-off analysis against motion realism or prompt fidelity beyond identity-centered evaluation and human preference (Meng et al., 16 Oct 2025). MultiCrafter and DreamVideo-Omni both rely on substantial annotation and auxiliary model infrastructure for reward or supervision (Wu et al., 26 Sep 2025, Wei et al., 12 Mar 2026).

Several open questions recur across the literature. One is whether identity should be modeled as an explicit latent state with temporal consistency guarantees, rather than as a reward proxy in embedding space. Another is how to prevent reward misspecification when the generator can optimize the scorer rather than the intended notion of identity. A third is scaling: many current demonstrations are strongest for two-subject or small-group settings, while larger populations or more subjects may create combinatorial assignment problems. Finally, the literature suggests a deeper conceptual split between tasks that require maximizing identity distinguishability and those that require minimizing it. This suggests that “identity coherence” is not identical to “identity salience”; rather, it concerns stable control over when and how identity should remain invariant, expressed, inferred, or concealed.