Persona-MME: Multimodal Personalization Benchmark

Updated 4 July 2026

The paper introduces Persona-MME as a benchmark that jointly evaluates long-term personalization across memory, intent, preference, behavior, relationship, growth, and alignment.
It employs rigorously annotated in-situ cases with diverse personas and dual context settings (32k and 128k) to simulate realistic multimodal interactions.
Empirical results demonstrate that PersonaVLM, using structured episodic memory and multi-stage reasoning, significantly outperforms baseline models.

Persona-MME is a benchmark for evaluating long-term personalized multimodal LLMs. It was introduced together with PersonaVLM to address a gap in existing personalization evaluation: prior benchmarks were described as too narrow because they were typically static, single-turn, or text-only, and they usually measured only one slice of personalization such as memory recall, preference following, or alignment. Persona-MME is designed as a comprehensive, in-situ benchmark for whether a model can remember user-specific multimodal information, reason over evolving histories and contextual cues, and align responses with a user’s evolving personality over long interaction horizons (Nie et al., 20 Mar 2026).

1. Origins and benchmark rationale

The benchmark was created from the claim that existing personalization evaluations do not reflect real long-term assistant use. In the motivating account, users’ preferences and personalities change over time, and multimodal interactions include images, cross-turn references, and temporal dependencies. Persona-MME therefore targets a joint evaluation space defined by long-term interaction realism, multimodal grounding, holistic personalization, dynamic personality and preference evolution, in-situ evaluation, broad topic diversity, and scalable evaluation under both shorter and very long context settings (Nie et al., 20 Mar 2026).

The paper situates Persona-MME against several earlier benchmarks. PERSONAMEM is characterized as covering long-term memory and understanding, but as text-only and not holistic; P-SOUPS and ALIGNX-test are described as alignment benchmarks with static preference or personality settings; Yo’LLaVA and RAP are presented as multimodal user-specific concept-understanding benchmarks with narrower memory or understanding scope. The paper’s explicit claim is that none of these jointly evaluate long-term, multimodal, memory, intent, preference, behavior, relationship, growth, and alignment in a single in-situ framework (Nie et al., 20 Mar 2026).

This positioning places Persona-MME within a broader research trajectory on personalization evaluation. PERSONAMEM focuses on dynamic profile tracking in long conversational histories and reports that frontier models achieve only around 50% overall accuracy on its response-selection task, suggesting that long context alone does not solve personalization under profile drift (Jiang et al., 19 Apr 2025). MPCHAT, by contrast, established a multimodal persona-grounded dialogue setting based on image-sentence persona pairs and showed that multimodal persona improves response prediction, grounding persona prediction, and speaker identification (Ahn et al., 2023). Persona-MME can therefore be understood as a later synthesis that combines long-horizon personalization with multimodal grounding and response alignment.

2. Dataset construction and corpus composition

Persona-MME contains over 2,000 in-situ cases in the main paper, and the appendix gives the exact count as 2,034 test cases derived from 200 diverse personas. It has two evaluation configurations: a 32k context setting for dialogues under 100 turns and a 128k context setting for dialogues of 100–500 turns. Each configuration contains cases from 100 distinct personas (Nie et al., 20 Mar 2026).

The appendix reports several corpus statistics. The average number of turns per dialogue is 142.9. The multimodal turn ratio is 15.87%. The average question length is 22.7 words, the average answer length is 3.05 words, and the image-related question ratio is 34.02% (Nie et al., 20 Mar 2026).

Cases are constructed from the user’s first-person perspective at a specific point in the conversation. Initial questions are generated using the Gemini-2.5-Pro API and then manually reviewed. Every test case is checked by four annotators for consistency with the assigned task, accuracy of the ground-truth answer, and alignment validity for personality-related items; ambiguous or conflicting examples are discarded. The paper reports that this review required about 40 person-hours (Nie et al., 20 Mar 2026).

The same paper also describes the broader generation pipeline around the benchmark. Persona generation begins from 700 personas sampled from PersonaHub, split into 500 train personas and 200 test personas, and these personas are enriched with random personality traits. Dialogue synthesis uses Seed1.6-thinking to simulate long-term conversations spanning up to one month in training and up to three months in testing, with more than 15% of dialogues containing multimodal elements. The training ecosystem associated with the benchmark includes 78k SFT samples, 5.6k RL samples, and 6k user-related concept samples, although that training set is separate from Persona-MME itself (Nie et al., 20 Mar 2026).

3. Evaluation ontology: seven aspects and fourteen tasks

Persona-MME is organized around seven core personalization aspects and fourteen fine-grained tasks. The seven aspects are memory, intent, preference, behavior, relationship, growth, and alignment. The fourteen tasks are Visual Detail Recall, Semantic Information Recall, Explicit Intent Inference, Implicit Intent Recognition, Latest Preference Recognition, Interest Evolution Analysis, Implicit Preference Recommendation, Behavioral Pattern Recognition, Long-term Goal Tracking, Relationship Recognition, Relationship Dynamics Comprehension, Tiered Explanation Delivery, Generalizing to New Scenarios, and Personality Alignment (Nie et al., 20 Mar 2026).

The benchmark’s aspect structure is intended to separate distinct personalization capabilities. Memory covers exact recall of visual and semantic user history. Intent evaluates understanding of what the user means, including unstated goals. Preference measures recency tracking, evolution, and implicit values. Behavior targets habits, routines, and goals. Relationship concerns people and social context. Growth evaluates whether explanation depth adapts to the user’s evolving skill level. Alignment measures whether tone, style, and behavioral stance match inferred personality (Nie et al., 20 Mar 2026).

A compact view of the ontology is useful because the benchmark’s breadth is central to its identity.

Aspect	Fine-grained tasks
Memory	Visual Detail Recall; Semantic Information Recall
Intent	Explicit Intent Inference; Implicit Intent Recognition
Preference	Latest Preference Recognition; Interest Evolution Analysis; Implicit Preference Recommendation
Behavior	Behavioral Pattern Recognition; Long-term Goal Tracking
Relationship	Relationship Recognition; Relationship Dynamics Comprehension
Growth	Tiered Explanation Delivery; Generalizing to New Scenarios
Alignment	Personality Alignment

The appendix adds an important qualification: the benchmark has 13 primary tasks, and personality alignment is evaluated concurrently in 406 of the primary-task cases rather than as a standalone category in exactly the same sense as the others (Nie et al., 20 Mar 2026). This suggests that Persona-MME treats alignment as both an independent personalization capability and a cross-cutting property of responses.

The task design also reflects adjacent research problems. PERSONAMEM measures recall of stable user facts, latest-preference recognition, preference evolution, reasons for updates, and generalization to new scenarios in text-only histories (Jiang et al., 19 Apr 2025). MRBench, in turn, decomposes persona-conditioned role-playing into memory anchoring, recalling, bounding, and enacting, arguing that persona consistency should be diagnosed as a staged memory pipeline rather than as a single holistic style score (Wang et al., 14 Mar 2026). Persona-MME differs in scope, but these neighboring works clarify the methodological significance of explicitly factorizing personalization.

4. Evaluation protocol and benchmark mechanics

Persona-MME primarily uses in-situ querying. Queries are posed from the user’s first-person viewpoint at a specific point in the dialogue history to simulate realistic assistant use. Each test case typically includes a multiple-choice question assessing memory or understanding and an optional personality test for alignment. For Persona-MME’s core tasks, the main metric is accuracy in percent; alignment-style evaluation also uses accuracy (Nie et al., 20 Mar 2026).

The benchmark supports both 32k and 128k context settings. The paper also uses Persona-MME to sample 200 questions for open-ended generation evaluation judged by Gemini-2.5-Pro. In that pairwise protocol, the judge compares Response A and Response B on accuracy and personalization and outputs exactly one of “Wins,” “Ties,” or “Loses.” The rule is “Wins” if A is better on at least one criterion and not worse on the other, “Loses” if B is better on at least one criterion and not worse on the other, and “Ties” otherwise (Nie et al., 20 Mar 2026).

The benchmark is evaluated against more than ten models, including proprietary systems such as GPT-4o, GPT-4o-mini, GPT-5, Gemini-2.5-Flash, and Claude-3.7-Sonnet, and open-source systems such as Qwen2.5-VL-7B, Qwen3-VL-8B, Qwen3-30B-A3B, InternVL3-8B, InternVL3-38B, and OneVision-1.5-8B. It also includes PersonaVLM variants and a RAG baseline in which Qwen2.5-VL-7B retrieves the top five most relevant messages (Nie et al., 20 Mar 2026).

Because Persona-MME was designed together with PersonaVLM, the benchmark is tightly linked to a particular long-term personalization architecture. PersonaVLM stores a personality profile as a Big Five vector together with core, semantic, episodic, and procedural memory. Core and procedural memory retain only the latest version, whereas semantic and episodic memory are chronological and additive. Retrieval is action-based: the model can emit a <retrieve> action with keywords and a time period in YYYY-MM-DD HH:MM format; retrieval then filters by time, searches semantic, episodic, and procedural memories in parallel, and returns top-k results. Default top-k is 2 for procedural, 4 for semantic, and 2 for episodic memory. Text memory retrieval uses all-MiniLM-L6-v2 and similarity search uses FAISS (Nie et al., 20 Mar 2026).

This tight coupling between benchmark and system is methodologically important. It means Persona-MME is not merely a task suite; it is also an explicit testbed for whether structured memory, retrieval, and personality updating improve long-term multimodal personalization. A plausible implication is that the benchmark is meant to stress capabilities that generic long-context prompting cannot reliably supply.

5. Empirical findings and benchmark difficulty

The main experimental finding is that PersonaVLM substantially improves over baseline systems on Persona-MME. In the 32k RAG setting, PersonaVLM_SFT scores 64.84 and PersonaVLM_RL scores 71.48, corresponding to a +10.28 improvement over the Qwen2.5-VL-7B RAG baseline in the table. In the 128k RAG setting, PersonaVLM_SFT scores 67.18 and PersonaVLM_RL scores 71.05, which the table reports as a +12.04 improvement over the baseline (Nie et al., 20 Mar 2026).

The abstract reports that PersonaVLM improves the baseline by 22.4% on Persona-MME and 9.8% on PERSONAMEM under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively (Nie et al., 20 Mar 2026). The paper also states that PersonaVLM is especially strong in Growth Modeling and Behavioral Awareness, where it beats GPT-4o by over 10%, although GPT-4o with full context still performs better in some memory-recall cases (Nie et al., 20 Mar 2026).

Per-task results further specify the benchmark’s difficulty profile. On 128k Persona-MME, PersonaVLM’s overall score is 77.08, with especially strong results on Long-term Goal Tracking at 62.16, Generalizing to New Scenarios at 92.00, Personality Alignment at 92.22, and Tiered Explanation Delivery at 82.76 (Nie et al., 20 Mar 2026). The appendix adds that proprietary models generally outperform open-source models, small open-source multimodal models struggle especially with alignment, large language-centric models such as Qwen3-30B-A3B can outperform larger multimodal models on overall personalization, and no single model dominates every subtask (Nie et al., 20 Mar 2026).

Open-ended evaluation yields an additional result: PersonaVLM achieves a 79% win rate against GPT-4o, with only a 16% loss rate in pairwise judgment by Gemini-2.5-Pro (Nie et al., 20 Mar 2026). This suggests that the benchmark does not only reward answer selection accuracy but also tracks perceived personalization quality under generative comparison.

The ablation findings are also benchmark-relevant. Removing episodic memory produces the largest degradation, with a -12.41% drop at 32k and a -5.19% drop at 128k; removing other memory types causes smaller drops, generally under 2%. Removing the reasoning stage lowers performance by -2.75% at 32k and -3.73% at 128k (Nie et al., 20 Mar 2026). These results support the paper’s claim that long-term personalization depends heavily on structured episodic memory plus multi-step reasoning.

6. Position within the broader personalization literature

Persona-MME occupies a specific niche in the recent literature on personas, memory, and personalization. Compared with PERSONAMEM, it extends evaluation from text-only dynamic profile tracking to multimodal, holistic personalization. Compared with MPCHAT, it moves from multimodal persona-grounded dialogue retrieval toward long-term assistant-style reasoning across memory, intent, preference, behavior, relationship, growth, and alignment. Compared with MRBench, it is less focused on diagnostic stage decomposition and more focused on realistic in-situ personalization under long horizons (Jiang et al., 19 Apr 2025, Ahn et al., 2023, Wang et al., 14 Mar 2026).

The benchmark also complements work on multi-user identity separation. AFA defines persona confusion as the failure mode in which a shared assistant conflates one user’s history or preferences with another’s, and introduces Persona Attribution Accuracy to test whether responses align with the correct user under interleaved multi-user dialogue (Al-Ratrout et al., 27 Apr 2026). Persona-MME does not center that shared-device setting, but the comparison highlights that long-term personalization and correct persona attribution are separable problems.

Another adjacent line concerns internal persona representations. “Tracing Persona Vectors Through LLM Pretraining” reports that persona-like linear directions for traits such as evil, sycophancy, impoliteness, and humor form within 0.22% of OLMo-3 pretraining and remain effective for steering fully post-trained instruct models (Moskvoretskii et al., 13 May 2026). That work addresses persona as an internal activation-space phenomenon, whereas Persona-MME evaluates externally observable personalization behavior. This suggests that benchmark-level personalization and mechanistic persona representations may eventually be studied together, but the current benchmark remains behavior-centric.

Within this landscape, Persona-MME’s distinct contribution is its claim that long-term personalized multimodal evaluation should be comprehensive rather than task-fragmented. It encodes the view that a personalized assistant must not merely remember isolated facts or mimic a static preference profile, but must integrate visual detail, semantic history, latent intent, preference recency, evolving interests, behavioral regularities, relationship structure, user growth, and personality-consistent response style within a single evaluation frame (Nie et al., 20 Mar 2026).

7. Interpretation, scope, and open questions

The benchmark’s central implication is that holistic personalization remains unsolved even for strong multimodal systems. The paper explicitly argues that retrieval alone is insufficient, that personality alignment requires dynamic updating, and that memory, reasoning, and response alignment must be evaluated together rather than in isolation (Nie et al., 20 Mar 2026).

At the same time, Persona-MME has a defined scope. It is benchmarked in 32k and 128k settings rather than unconstrained interaction, it relies on curated in-situ cases rather than deployment logs, and its construction uses model-generated questions followed by human review. This suggests that the benchmark emphasizes controlled breadth and annotation quality over uncontrolled ecological sampling. A plausible implication is that Persona-MME is best understood as a high-coverage research benchmark rather than as a direct deployment audit.

Several surrounding works indicate possible future extensions. PERSONAMEM and PersonaMem-v2 place greater emphasis on dynamic user profiling, preference drift, implicit preference inference, and agentic memory over long textual histories (Jiang et al., 19 Apr 2025, Jiang et al., 7 Dec 2025). AFA foregrounds shared-device identity routing and multi-user leakage (Al-Ratrout et al., 27 Apr 2026). MM- $\tau$ - $p^2$ introduces persona-aware multimodal agent evaluation in dual-control customer-support settings with explicit robustness, clarification, recovery, and safety metrics (Purwar et al., 10 Mar 2026). These directions suggest that future successors to Persona-MME may combine multimodal long-term personalization with stronger treatment of implicit preference inference, multi-user attribution, and operational robustness.

Persona-MME therefore marks a specific stage in the evolution of persona-aware evaluation: a benchmark in which multimodality, long-term interaction, and holistic personalization are made coequal design requirements. Its significance lies less in any single subtask than in the proposition that personalized assistant competence must be measured across a structured set of interlocking capabilities rather than reduced to memory recall or preference matching alone (Nie et al., 20 Mar 2026).