OmniPersona: Unified Personalization in LMMs

Updated 18 January 2026

OmniPersona is a unified personalization framework that decouples understanding, generation, and editing tasks in large multimodal models.
It employs structured concept tokens and an explicit knowledge replay mechanism to preserve user-specific attributes and ensure cross-task coherence.
Empirical evaluations demonstrate significant improvements in recognition, generation, and editing metrics over previous personalization approaches.

OmniPersona encompasses a class of methodologies and frameworks for unified, end-to-end personalization in large multimodal models (LMMs). The hallmark of OmniPersona approaches is the structured, consistent, and efficient encoding, retrieval, and operational deployment of user- or entity-specific concepts—whether in dialogue agents, image/video generators, or unified LMMs—while maintaining identity, attribute grounding, and cross-task coherence. With rigorous architectural decoupling of task-specific components and explicit replay mechanisms for knowledge propagation, OmniPersona frameworks set a new baseline for consistent, controllable, and interpretable personalization across understanding, generation, and editing modalities (Zhong et al., 11 Jan 2026).

1. Motivation and Scope

Conventional LMMs operate under “one-size-fits-all” paradigms, demonstrating suboptimal performance when tasked with integrating novel user-specific or personalized entities, especially with limited supervision. Approaches such as retrieval-augmented generation (RAG) or monolithic learnable prompts often result in cross-task interference, inefficient inference, or poor controllability. OmniPersona addresses these challenges by introducing explicit decoupling between understanding, generation, and editing branches, and enforcing a knowledge replay mechanism that externalizes and propagates attribute representations across tasks (Zhong et al., 11 Jan 2026). The result is a system capable of:

Unified Understanding: Recognizing and reasoning over user-specific concepts in multimodal input.
Controlled Generation: Synthesizing novel samples (images, text) that explicitly reflect grounded personalized attributes.
Targeted Editing: Enabling identity-preserving edits that consider both the concept and contextual user intent.

The applicability of OmniPersona is broad, spanning AI assistants, pedagogical agents, identity-controlled generative models, and large-scale dataset augmentation for tasks such as person re-identification (ReID) (Ma et al., 2 Dec 2025, Wang et al., 17 Nov 2025).

2. Core Architectural Principles

OmniPersona implementations share several foundational architectural elements designed to maximize cross-task personalization with minimal interference:

Decoupled Concept Tokens

Given a personalized concept identifier (e.g., $<$ sks $>$ for a user or object) and a corresponding small reference set, OmniPersona creates a bank of learnable tokens routed to three structurally distinct subspaces:

Understanding expert
Generation expert
Editing expert

Each subspace is defined via a frozen task-adapter ( $W^{(t)}\in\mathbb{R}^{d\times d}$ ), and tokens are projected to task-specific representations $q_i^{(t)} = W^{(t)}E(c_i)$ . These tokens, stored in matrices $P^{(\text{und})}$ , $P^{(\text{gen})}$ , and $P^{(\text{edit})}$ , are processed only in their respective expert branches $\mathcal{F}_t$ (Zhong et al., 11 Jan 2026). This design eliminates gradient conflict and enables the development of distinct, specialized concept representations.

Explicit Knowledge Replay

To mitigate pattern memorization and ensure true attribute grounding, an inference-time replay mechanism maintains per-concept external “concept memory” $K$ . With each reasoning pass, $K$ is updated via:

$K_{t+1} = \gamma K_t + (1-\gamma)f_t(c_i)$

where $f_t(c_i)$ extracts the semantic summary for concept $c$ from expert $t$ and $\gamma\in[0,1]$ is a decay parameter. During generation and editing, this externalized attribute knowledge is appended to user queries and prompts, enforcing attribute fidelity and interpretability.

Unified Inference Pipeline

All three tasks—understanding, generation, and editing—are scheduled through a sequence of (logically) sequential but computationally unified stages comprising intent parsing, memory retrieval, prompt composition, and final model application, e.g.:

Intent Parser: Converts user requests into explicit queries.
Memory Retriever: Retrieves grounded concept attributes.
Prompt Composer: Integrates retrieved attributes.
Final Generation/Editing: Produces output faithful to concept and intent.

All modules share the same backbone LMM, with only concept token embeddings updated during training.

3. Training, Losses, and Optimization

OmniPersona is optimized solely by updating the concept token matrices, with all transformer/LMM weights frozen. Training jointly addresses:

Textual Understanding: $\mathcal{L}_{\text{text}}^{\mathrm{CE}} = -\sum_{i=1}^C x_i\log\hat x_i$ (autoregressive CE on text).
Image Generation/Editing: Standard diffusion-based mean-squared error,

$\mathcal{L}_{\mathrm{image}}^{\mathrm{MSE}} = \mathbb{E}\left[\|g_\theta(x_t|c)-(x_0-x_1)\|_2^2\right]$

similarly for editing tasks.

Multi-task Loss: The aggregate loss combines the above, weighted by $\lambda_{\mathrm{image}}$ and $\lambda_{\mathrm{edit}}$ (both set to $\approx$ 400):

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{text}}^{\mathrm{CE}} + \lambda_{\mathrm{image}}\mathcal{L}_{\mathrm{image}}^{\mathrm{MSE}} + \lambda_{\mathrm{edit}}\mathcal{L}_{\mathrm{edit}}^{\mathrm{MSE}}$

Explicit supervision for editing regularizes representations and improves attribute fidelity across tasks (Zhong et al., 11 Jan 2026).

4. Empirical Evaluation and Benchmarks

Performance of OmniPersona models is systematically evaluated using OmniPBench, a benchmark spanning:

20 user concepts (10 people, 5 pets, 5 objects)
Task protocols: understanding (recognition, VQA), generation (concept and attribute-conditioned), and editing (removal, attribute change, spatial/style transformations)
Metrics:
- Understanding: balanced recall, VQA-BLEU, VQA-GPT, QA-BLEU, QA-GPT
- Generation: CLIP-I (identity), CLIP-T (text alignment), DINO perceptual similarity, Face-Simi (ArcFace)
- Personalized Attribute Reasoning Generation (PARG): GPT-4o holistic, CLIP-I
- Editing: SEMA-C (semantic consistency), QUAL-I (image quality), averaged for global editing score

Compared to prior unified personalization models (UniCTokens, Yo’Chameleon), OmniPersona demonstrates:

+7.8 percentage points in recognition, +13.1 pp avg QA score
Highest CLIP-I (0.791 vs. 0.750 for UniCTokens); face sim. 0.413 vs. 0.334
PARG holistic score 0.613 vs. 0.359 (70.8% relative improvement)
SOTA editing: Avg-edit 0.658 (exceeds GPT-4o+IP and Bagel+TP)

Ablations highlight that removing decoupling or knowledge replay substantially degrades performance and induces feature collapse (Zhong et al., 11 Jan 2026).

5. Comparison with Pedestrian Generation and Memory-based Personalization

OmniPersona models for multimodal assistant tasks are distinct in their explicit decomposition and unified end-to-end design. In parallel, the pedestrian generation “OmniPerson” framework (Ma et al., 2 Dec 2025) exemplifies unified personalization in data augmentation for person re-ID:

Employs a latent diffusion model backbone with multi-source conditions, including pose, background, text, and modality encoders.
“Multi-Refer Fuser” module distills identity from arbitrary reference images using channel and self-attention.
Achieves best-in-class identity consistency: e.g. Market-1501 ReID-Sim of 0.9577, outperforming AnimateAnyone (0.9131) and Pose2Id (0.9161).
Synthetic data augmentation yields significant mAP/rank-1 improvements in downstream ReID tasks.

In contrast, memory-based agents inspired by O-Mem (Wang et al., 17 Nov 2025) apply dynamic user profiling, hierarchical memory with parallel retrieval, and active attribute/event extraction to achieve long-horizon, contextually consistent personalization in language agents, delivering F1/accuracy improvements over prior frameworks and substantial efficiency gains.

6. Strengths, Limitations, and Future Directions

Advantages:

Decoupling structurally ensures task-specific specialization, preventing cross-task interference.
Explicit replay renders attribute knowledge interpretable and enforces semantic consistency in generation and editing.
Unified pipelines facilitate cross-task reasoning, enabling seamless multimodal personalization.

Limitations:

Editing diversity is currently limited—most training focuses on attribute removal, with limited generalization to style or complex compositional edits.
In cluttered scenes, concept localization may degrade.
There exists a trade-off between perfect identity preservation and instruction-following fidelity, particularly in editing.

Suggested directions:

Auto-expanding dynamic vocabulary for concept representation.
Finer-grained, per-layer routing of concept tokens.
Enrichment of editing datasets to support broader manipulation types, including style transfer and compositional re-editing.
Real-user studies to evaluate in-the-wild personalization effectiveness (Zhong et al., 11 Jan 2026).

7. Synthesis and Impact

OmniPersona frameworks advance the state of the art in unified personalization for LMMs by integrating structurally decoupled, interpretable concept representations, explicit attribute replay, and multi-task compatibility in a single, frozen-backbone architecture. The result is both improved empirical performance and a blueprint for future research in scalable, controllable, and robust multimodal personalization, with implications for AI assistants, image/video personalization, and adaptive user modeling across diverse AI systems (Zhong et al., 11 Jan 2026, Ma et al., 2 Dec 2025, Wang et al., 17 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Unified Personalized Understanding, Generating and Editing (2026)

OmniPerson: Unified Identity-Preserving Pedestrian Generation (2025)

O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniPersona.