OmniPersona: Unified Personalization in LMMs
- OmniPersona is a unified personalization framework that decouples understanding, generation, and editing tasks in large multimodal models.
- It employs structured concept tokens and an explicit knowledge replay mechanism to preserve user-specific attributes and ensure cross-task coherence.
- Empirical evaluations demonstrate significant improvements in recognition, generation, and editing metrics over previous personalization approaches.
OmniPersona encompasses a class of methodologies and frameworks for unified, end-to-end personalization in large multimodal models (LMMs). The hallmark of OmniPersona approaches is the structured, consistent, and efficient encoding, retrieval, and operational deployment of user- or entity-specific concepts—whether in dialogue agents, image/video generators, or unified LMMs—while maintaining identity, attribute grounding, and cross-task coherence. With rigorous architectural decoupling of task-specific components and explicit replay mechanisms for knowledge propagation, OmniPersona frameworks set a new baseline for consistent, controllable, and interpretable personalization across understanding, generation, and editing modalities (Zhong et al., 11 Jan 2026).
1. Motivation and Scope
Conventional LMMs operate under “one-size-fits-all” paradigms, demonstrating suboptimal performance when tasked with integrating novel user-specific or personalized entities, especially with limited supervision. Approaches such as retrieval-augmented generation (RAG) or monolithic learnable prompts often result in cross-task interference, inefficient inference, or poor controllability. OmniPersona addresses these challenges by introducing explicit decoupling between understanding, generation, and editing branches, and enforcing a knowledge replay mechanism that externalizes and propagates attribute representations across tasks (Zhong et al., 11 Jan 2026). The result is a system capable of:
- Unified Understanding: Recognizing and reasoning over user-specific concepts in multimodal input.
- Controlled Generation: Synthesizing novel samples (images, text) that explicitly reflect grounded personalized attributes.
- Targeted Editing: Enabling identity-preserving edits that consider both the concept and contextual user intent.
The applicability of OmniPersona is broad, spanning AI assistants, pedagogical agents, identity-controlled generative models, and large-scale dataset augmentation for tasks such as person re-identification (ReID) (Ma et al., 2 Dec 2025, Wang et al., 17 Nov 2025).
2. Core Architectural Principles
OmniPersona implementations share several foundational architectural elements designed to maximize cross-task personalization with minimal interference:
Decoupled Concept Tokens
Given a personalized concept identifier (e.g., sks for a user or object) and a corresponding small reference set, OmniPersona creates a bank of learnable tokens routed to three structurally distinct subspaces:
- Understanding expert
- Generation expert
- Editing expert
Each subspace is defined via a frozen task-adapter (), and tokens are projected to task-specific representations . These tokens, stored in matrices , , and , are processed only in their respective expert branches (Zhong et al., 11 Jan 2026). This design eliminates gradient conflict and enables the development of distinct, specialized concept representations.
Explicit Knowledge Replay
To mitigate pattern memorization and ensure true attribute grounding, an inference-time replay mechanism maintains per-concept external “concept memory” . With each reasoning pass, is updated via:
where extracts the semantic summary for concept from expert and is a decay parameter. During generation and editing, this externalized attribute knowledge is appended to user queries and prompts, enforcing attribute fidelity and interpretability.
Unified Inference Pipeline
All three tasks—understanding, generation, and editing—are scheduled through a sequence of (logically) sequential but computationally unified stages comprising intent parsing, memory retrieval, prompt composition, and final model application, e.g.:
- Intent Parser: Converts user requests into explicit queries.
- Memory Retriever: Retrieves grounded concept attributes.
- Prompt Composer: Integrates retrieved attributes.
- Final Generation/Editing: Produces output faithful to concept and intent.
All modules share the same backbone LMM, with only concept token embeddings updated during training.
3. Training, Losses, and Optimization
OmniPersona is optimized solely by updating the concept token matrices, with all transformer/LMM weights frozen. Training jointly addresses:
- Textual Understanding: (autoregressive CE on text).
- Image Generation/Editing: Standard diffusion-based mean-squared error,
similarly for editing tasks.
- Multi-task Loss: The aggregate loss combines the above, weighted by and (both set to 400):
Explicit supervision for editing regularizes representations and improves attribute fidelity across tasks (Zhong et al., 11 Jan 2026).
4. Empirical Evaluation and Benchmarks
Performance of OmniPersona models is systematically evaluated using OmniPBench, a benchmark spanning:
- 20 user concepts (10 people, 5 pets, 5 objects)
- Task protocols: understanding (recognition, VQA), generation (concept and attribute-conditioned), and editing (removal, attribute change, spatial/style transformations)
- Metrics:
- Understanding: balanced recall, VQA-BLEU, VQA-GPT, QA-BLEU, QA-GPT
- Generation: CLIP-I (identity), CLIP-T (text alignment), DINO perceptual similarity, Face-Simi (ArcFace)
- Personalized Attribute Reasoning Generation (PARG): GPT-4o holistic, CLIP-I
- Editing: SEMA-C (semantic consistency), QUAL-I (image quality), averaged for global editing score
Compared to prior unified personalization models (UniCTokens, Yo’Chameleon), OmniPersona demonstrates:
- +7.8 percentage points in recognition, +13.1 pp avg QA score
- Highest CLIP-I (0.791 vs. 0.750 for UniCTokens); face sim. 0.413 vs. 0.334
- PARG holistic score 0.613 vs. 0.359 (70.8% relative improvement)
- SOTA editing: Avg-edit 0.658 (exceeds GPT-4o+IP and Bagel+TP)
Ablations highlight that removing decoupling or knowledge replay substantially degrades performance and induces feature collapse (Zhong et al., 11 Jan 2026).
5. Comparison with Pedestrian Generation and Memory-based Personalization
OmniPersona models for multimodal assistant tasks are distinct in their explicit decomposition and unified end-to-end design. In parallel, the pedestrian generation “OmniPerson” framework (Ma et al., 2 Dec 2025) exemplifies unified personalization in data augmentation for person re-ID:
- Employs a latent diffusion model backbone with multi-source conditions, including pose, background, text, and modality encoders.
- “Multi-Refer Fuser” module distills identity from arbitrary reference images using channel and self-attention.
- Achieves best-in-class identity consistency: e.g. Market-1501 ReID-Sim of 0.9577, outperforming AnimateAnyone (0.9131) and Pose2Id (0.9161).
- Synthetic data augmentation yields significant mAP/rank-1 improvements in downstream ReID tasks.
In contrast, memory-based agents inspired by O-Mem (Wang et al., 17 Nov 2025) apply dynamic user profiling, hierarchical memory with parallel retrieval, and active attribute/event extraction to achieve long-horizon, contextually consistent personalization in language agents, delivering F1/accuracy improvements over prior frameworks and substantial efficiency gains.
6. Strengths, Limitations, and Future Directions
Advantages:
- Decoupling structurally ensures task-specific specialization, preventing cross-task interference.
- Explicit replay renders attribute knowledge interpretable and enforces semantic consistency in generation and editing.
- Unified pipelines facilitate cross-task reasoning, enabling seamless multimodal personalization.
Limitations:
- Editing diversity is currently limited—most training focuses on attribute removal, with limited generalization to style or complex compositional edits.
- In cluttered scenes, concept localization may degrade.
- There exists a trade-off between perfect identity preservation and instruction-following fidelity, particularly in editing.
Suggested directions:
- Auto-expanding dynamic vocabulary for concept representation.
- Finer-grained, per-layer routing of concept tokens.
- Enrichment of editing datasets to support broader manipulation types, including style transfer and compositional re-editing.
- Real-user studies to evaluate in-the-wild personalization effectiveness (Zhong et al., 11 Jan 2026).
7. Synthesis and Impact
OmniPersona frameworks advance the state of the art in unified personalization for LMMs by integrating structurally decoupled, interpretable concept representations, explicit attribute replay, and multi-task compatibility in a single, frozen-backbone architecture. The result is both improved empirical performance and a blueprint for future research in scalable, controllable, and robust multimodal personalization, with implications for AI assistants, image/video personalization, and adaptive user modeling across diverse AI systems (Zhong et al., 11 Jan 2026, Ma et al., 2 Dec 2025, Wang et al., 17 Nov 2025).