- The paper introduces P-MLLM, a novel multimodal LLM architecture that leverages user profiles to enable zero-shot personalized image aesthetics assessments.
- It employs a selective fusion module with embedding-conditioned gating to integrate visual features with textual reasoning, leading to improved metrics such as SROCC and PLCC.
- The approach demonstrates scalable personalization using basic demographic data, significantly outperforming traditional methods in cold-start scenarios.
Enhancing Zero-shot Personalized Image Aesthetics Assessment with Profile-aware Multimodal LLM
Introduction
Personalized Image Aesthetics Assessment (PIAA) aims to predict image aesthetic judgements tailored to individual user preferences. The subjective nature of aesthetics necessitates modeling user-specific tendencies, which is challenging in the absence of user historical ratings—a typical zero-shot/cold-start scenario. The paper "Enhancing Zero-shot Personalized Image Aesthetics Assessment with Profile-aware Multimodal LLM" (2604.17233) addresses this issue by introducing a profile-conditioning paradigm: user profiles, which are generally easier to obtain than explicit rating histories, are leveraged to inform model predictions.
Traditional PIAA approaches rely on fine-tuning generic IAA models with scarce per-user rating data or exploiting adaptation and meta-learning frameworks. However, these approaches fundamentally depend on task-specific user annotations, which constrains practical deployment. Some methods integrate auxiliary user information (e.g., demographics, personality traits), but always in tandem with subjective ground truth.
Recent work in personalized language modeling investigates augmenting LLMs with user profile information to generate personified responses. However, such techniques rarely extend to multimodal tasks, and mechanisms for profile-conditioned fusion in vision-LLMs remain primitive.
The P-MLLM Framework
Architecture
The paper introduces P-MLLM, a profile-aware multimodal LLM tailored for profile-based zero-shot PIAA. P-MLLM is built by augmenting a frozen LLM backbone (Llama3.1-8B-Instruct) with a pretrained image encoder (CLIP-ViT-L/14) and lightweight selective fusion modules. Visual feature integration is conducted exclusively within the lowest L transformer blocks through parallel side paths, maintaining the integrity of the LLM’s textual reasoning stream.
The selective fusion module enables controlled, fine-grained visual injection: at each fusion site, profile- and context-aware gating modulates per-head cross-attention between image features and relevant LLM hidden states. Head-specific gating values, conditioned on the evolving hidden state, are generated by position-dependent linear projections followed by a sigmoid, enabling dynamic, contextually appropriate visual grounding. Projections and normalizations in the fusion blocks are initialized from their self-attention analogues and kept frozen, enforcing a consistent representational space.
Notably, fusion is applied with masking logic, augmenting only the question- and answer-relevant tokens while leaving user profile representations intact.
Dataset Construction
To jointly optimize for multimodal, profile-conditioned reasoning, the authors construct a training set encompassing three complementary task types:
- PIAA-oriented tasks: Profile-conditioned, subjective aesthetic evaluations at both global and attribute levels.
- Image-independent subjective tasks: Pure text, profile-dependent queries to anchor the LLM’s native personality-conditioned response patterns and prevent spurious visual bias.
- Vision-only captioning tasks: Profile-agnostic, objective visual descriptions to ensure reliable vision-language alignment.
Training samples are constructed as tuples (profile, image, question, answer) with cross-varied factors to guarantee disentanglement of image, profile, and question effects.
Genetic search is employed to optimize profile prompt format, explicitly minimizing the discrepancy between specified and model-expressed personality attributes over the LMLPA personality-testing questionnaire.
Experimental Evaluation
Datasets
P-MLLM is evaluated on PARA and LAPIS, the only available datasets providing both explicit user profiles and subjective aesthetic ratings. PARA offers demographic and personality metadata, while LAPIS provides richer demographics and art-related interests.
Baselines and Metrics
Comparison is made against leading GIAA MLLMs (AesExpert, Q-Instruct), general-purpose MLLMs (Qwen2.5-VL, GPT-4o-mini), a serial-pipeline baseline converting MLLM-generated descriptions to scores, and a scalar-gate ablation variant of P-MLLM (P-MLLM-S). Metrics include SROCC and PLCC for score correlation, and ICC for profile-conditioned response consistency.
Results
P-MLLM exhibits superior zero-shot PIAA performance on both PARA and LAPIS in terms of SROCC and PLCC, consistently outperforming all baselines and even when only coarse demographic profiles are available. Notably, even with demographics-only profiles (no personality traits), P-MLLM outperforms strong general-purpose MLLMs such as GPT-4o-mini.
The embedding-conditioned gating employed in the selective fusion module yields measurable gains over scalar-gated ablations, demonstrating that dynamic fusion substantially enhances model flexibility and the encoding of nuanced, user-specific aesthetic signals.
Consistency experiments confirm that the P-MLLM architecture preserves and transfers the LLM’s inherent profile-conditioned reasoning pattern (measured by ICC across Big-Five dimensions), further evidencing the preservation of user persona semantics in the multimodal reasoning pipeline.
Implications and Future Directions
By operationalizing user profiles as primary signals for PIAA, P-MLLM removes the dependency on cold-start annotation, enabling practical deployment in real-world scenarios where subjective user data is unattainable. The demonstrated effectiveness of even coarse (demographic-only) profiles suggests that robust personalization can be achieved with widely available metadata, facilitating scalable personalized curation and recommendation.
From a theoretical perspective, the P-MLLM architecture highlights the necessity of fine-grained, context-dependent visual fusion for subjective multimodal tasks. Embedding-conditioned gating is empirically validated as a mechanism for dynamically governing profile–visual interactions—a finding likely to generalize to other multimodal LLM applications such as personalized visual storytelling or emotion-aware content generation.
Future work may explore leveraging additional user- and context-derived signals, scaling the fusion paradigm to deeper model architectures, and investigating cross-domain or cross-task transfer for personalization scenarios.
Conclusion
This work advances the state of personalized image aesthetics assessment by introducing P-MLLM, a profile-aware multimodal LLM with dynamic, profile-conditioned fusion modules. The approach enables competitive zero-shot personalization, robustly leverages profile semantics for multimodal reasoning, and demonstrates strong empirical gains over prior art—even with minimal user information (2604.17233). The profile-based personalization paradigm and controllable fusion mechanism outlined here provide a solid foundation for advancing subjective multimodal tasks under practical constraints.