OmniPBench: Unified Personalized Multimodal Evaluation
- OmniPBench is a comprehensive benchmarking suite designed for evaluating personalized multimodal models across understanding, generation, and image editing tasks.
- It introduces rigorous protocols and a unified concept-token representation learned from few-shot exemplars to enforce cross-task consistency.
- The benchmark leverages specialized editing data and diverse evaluation metrics to assess model performance in recognition, VQA, synthesis, and precise image editing.
OmniPBench is a benchmarking suite designed for the systematic evaluation of unified large multimodal models (LMMs) with respect to unified personalized understanding, image generation, and image editing. Developed as an extension of UnifyBench, OmniPBench introduces rigorous protocols and data splits focused on personalized concept modeling, enabling the first end-to-end assessment of whether a learned personalized concept representation can be consistently recognized, described, generated, and edited within a single model architecture (Zhong et al., 11 Jan 2026).
1. Benchmark Objectives and Motivation
OmniPBench addresses the absence of comprehensive benchmarks for personalization extending beyond understanding and generation, particularly personalized image editing. The main objectives are:
- To assess whether a personalized concept (abstracted as, e.g., "sks") is consistently recognized, described, generated, and edited across multiple tasks using a single learned representation.
- To evaluate whether supervision from image editing tasks improves performance in understanding and generation (synergy assessment).
- To introduce cross-task protocols requiring that all modalities reuse a single “personalized concept” token set learned from few-shot exemplars and editing data, thereby enforcing cross-task generalization and representation integrity.
2. Dataset Composition and Structure
OmniPBench builds on the UnifyBench concept set, comprising 20 personalized concepts:
- Persons: 10 unique identities
- Pets: 5 distinct animals
- Objects: 5 object entities
For each concept, approximately 10 exemplars (images) and a concise textual description are collected, configuring a few-shot regime. These serve both concept-token learning and task evaluation.
A specialized editing subset is constructed, consisting of triplets :
- : Source image featuring the concept
- : Free-form editing instruction (e.g., “remove sks from the photo”)
- : Ground-truth edited image, manually verified (≥90% quality)
Editing operations are equally distributed among five categories (each 20%):
- Object manipulation (e.g., removal)
- Attribute modification (e.g., pose change)
- Spatial transformation (e.g., view-point change)
- Environment interaction (e.g., lighting variation)
- Style appearance (e.g., converting to a sketch)
Splitting protocol allocates exemplars and edit triplets for training (for learning concept tokens) and for validation/test (held-out images, unseen edit instructions).
3. Task Definitions and Formalizations
OmniPBench structures evaluation across four principal tasks:
| Task Category | Input/Prompt | Expected Output |
|---|---|---|
| Personalized Understanding | Recognition, VQA, Text QA | Binary/QA answer (text/multimodal) |
| Personalized Generation | "Generate a photo of sks" / Attribute prompt | Synthetic image |
| Personalized Attribute-Reasoning Generation (PARG) | Attribute-referencing prompt | Image requiring correct attribute recall/synthesis |
| Personalized Image Editing | Edited image |
- Recognition: Binary decision whether an image contains the personalized concept.
- Visual QA (VQA): Multimodal question-answering about the concept in an image.
- Text QA (QA): Attribute questions in text only (no image).
- Pure Generation / People Generation: Synthesis of the concept entity or person, given a prompt.
- Attribute-Reasoning Generation (PARG): Model must synthesize an image conditioned on a recall of a learned attribute (e.g., location, accessory).
- Editing: The model must output an image following , preserving the entity’s identity.
4. Cross-Task Evaluation Protocols
Each personalized concept is represented by a set of learnable tokens , where (16 for understanding, 16 for generation). All four tasks rely on the same set of learned soft tokens; these are trained on the union of understanding, generation, and editing data. After training, tokens are frozen and all tasks are evaluated independently on held-out data.
For PARG, an explicit knowledge replay pipeline is utilized at inference:
- The system parses the prompt intent.
- Attributes are retrieved from .
- A grounded prompt is composed and used to generate the output.
Cross-task consistency is enforced and quantified by requiring reuse of the same concept tokens in understanding, generation, attribute reasoning, and editing.
5. Evaluation Metrics
Metrics are both standard and LLM-based, covering all tasks:
- Personalized Understanding:
- Recognition (Rec.): Balanced recall:
- VQA-BLEU and QA-BLEU: BLEU n-gram overlap. - VQA-GPT, QA-GPT: Semantics assessed by GPT-4o on a [0,1] scale.
Personalized Generation:
- CLIP-I (identity similarity), CLIP-T (text-image alignment):
- DINO (perceptual similarity): DINO-v2 embeddings, same aggregation as CLIP-I. - Face-Simi: ArcFace cosine similarity for people.
Personalized Attribute-Reasoning Generation (PARG):
- PARG-Score: GPT-4o-based holistic score [0,1] for attribute faithfulness.
- PARG-CLIP-I: As CLIP-I, using attribute-conditioned references.
- Personalized Image Editing:
- Semantic Consistency (SEMA-C): GPT-4o judge [0,1], measuring instruction adherence and identity preservation.
- Quality of Image (QUAL-I): GPT-4o naturalness/artifact assessment [0,1].
- Overall Edit Score:
Evaluation is reported as a mean over all 20 concepts per metric.
6. Experimental Setup and Baselines
Concept-token learning involves updating only the personalized tokens (not backbone parameters), using the AdamW optimizer (learning rate , $2000$ steps per concept, batch size $8$), with initialization at zero. The underlying backbone is the Bagel model, kept entirely frozen for fair evaluation of token-based personalization.
Inference employs 512×512 images, 50 diffusion steps, classifier-free guidance (text scale 4.0, image scale 1.0 for generation or 2.0 for editing), no sampling randomness (), and temperature $0.3$ for text.
Baselines include:
- Understanding-only: Yo'LLaVA, MC-LLaVA, RAP-MLLM, Qwen2.5-VL+TP
- Generation-only: Text inversion, DreamBooth (Stable Diffusion)
- Unified personalization: Chameleon+TP/IP, Show-o+TP, Bagel+TP, Yo’Chameleon, UniCToken
- Retrieval-augmented: RAP (RAG)
The evaluation protocol mandates that tokens be trained with combined understanding, generation, and editing data, followed by freezing and independent assessment on held-out test splits for all four task categories.
7. Significance and Impact
OmniPBench establishes a rigorous standard for evaluating unified personalized LMMs, bridging the gap left by prior benchmarks that focused primarily on understanding or generation, with little or no systematic coverage of editing. Its integration of editing data, unified cross-task protocols, and comprehensive metric suite enables detailed analysis of cross-task consistency, identity preservation, and the effect of editing supervision on overall personalization. By requiring all modalities to operate over a single learned concept-token representation, OmniPBench exposes cross-task interference and ensures that unified personalization is properly measured. This benchmark provides a foundation for both model development and comparative analysis in the rapidly evolving field of personalized multimodal AI (Zhong et al., 11 Jan 2026).