ImageGem Dataset: Personalized Generative Models

Updated 28 October 2025

ImageGem Dataset is a comprehensive resource linking over 4.9M filtered images, 3M text prompts, and 242K LoRA models with detailed user interaction data.
It enables preference alignment and personalized generative model editing by integrating user feedback, CLIP embeddings, SVD, and PCA-based latent space construction.
Empirical evaluations demonstrate significant improvements in metrics such as Pick Score, HPSv2, and CLIP Score for personalized model recommendation and retrieval.

ImageGem is a large-scale generative image interaction dataset designed to advance fine-grained, user-specific personalization of generative models. Constructed from real-world activity on Civitai, a prominent diffusion model sharing platform, ImageGem provides a comprehensive resource comprising user-generated images, text prompts, LoRA model checkpoints, and detailed interaction-driven preference annotations. By connecting these core entities through a relational structure, the dataset enables rigorous study of preference alignment, model recommendation, and personalized editing frameworks within latent weight spaces.

1. Dataset Structure and Composition

ImageGem consists of interlinked metadata from three principal sources:

Images: Approximately 5.7 million raw images (4.9 million after safety filtering) generated by users employing customizable diffusion models.
Text Prompts: Around 3 million distinct user-authored prompts, offering a diverse range of semantic and stylistic inputs.
LoRA Models: 242,000 user-built model checkpoints representing fine-tuned low-rank adaptations ("LoRAs") originating from stable diffusion architectures.

The dataset further captures ancillary metadata, notably:

Over 100,000 unique model tags categorizing generative content and model characteristics.
User feedback comprising emoji responses (thumbs-up, heart, laugh, cry) and engagement histories across a cohort of 57,000 distinct users—many serving both as model creators and consumers.
A ternary relational structure linking images, LoRAs, and user profiles for granular analysis of individual and population-wide preferences.

Statistical summaries indicate an average of 49 images and 12 LoRA models per uploader, with additional visual analytics such as UMAP projections applied to image CLIP embeddings and LoRA checkpoints to map content diversity.

Core Entity	Quantity	Metadata Examples
Images	≈5.7M raw; ≈4.9M filtered	CLIP embeddings, tags, emoji feedback
Text Prompts	≈3M	Prompt text
LoRA Models	≈242K	Checkpoints, tags

2. Annotation of User Preferences

ImageGem’s annotation schema captures both aggregate and individualized preference signals:

Aggregate Signals: User reactions to images via emojis and implicit feedback serve as “natural” preference indicators, eschewing explicit pairwise human annotation.
Individualized Preferences: Detailed interaction histories facilitate identification of each user’s “core preference cluster.” Computed by applying CLIP embeddings to image samples and clustering these with HDBScan, the clusters generate representative stylistic or subject categories.
Preference Alignment Data: For alignment tasks, preference pairs are assembled using min–max strategies within clusters, employing Human Preference Score v2 (HPSv2) as the aligning metric.

This multi-level annotation process is integral for the supervised training of models aimed at human-aligned image synthesis and model selection.

3. Enabling Generative Model Personalization

ImageGem enables a paradigm shift from generic model inference to individualized generative model adaptation:

Interaction-Driven Personalization: By leveraging user-specific LoRA creation and interaction metadata, the dataset bypasses reliance on textual prompts, exposing deeper individual preference signatures.
Latent Weights-to-Weights (W2W) Space Construction: Every LoRA checkpoint undergoes Singular Value Decomposition (SVD), distilling its weight matrix to the top singular component. By flattening and concatenating across layers, each model is expressed as a vector $\theta_i \in \mathbb{R}^d$ .
Principal Component Analysis (PCA): PCA on these vectors yields a set of basis directions $\{w_1, w_2, ..., w_m\}$ that encapsulate user preference variation.
Personalized Editing: An editing direction $v$ is learned via a binary classifier trained on CLIP-based similarity labels comparing each model’s outputs against a user’s preference cluster description. Personalized model generation is then performed by traversing the latent space:

$\theta_\text{edit} = \theta + \alpha \cdot v$

with scalar $\alpha$ modulating the degree of personalization. This mechanism generalizes LoRA modification for individual user alignment.

4. Empirical Evaluation and Benchmarks

ImageGem supports comprehensive evaluation across preference alignment and personalized retrieval:

Preference Alignment: Stable Diffusion 1.5 (SD1.5) is fine-tuned according to the DiffusionDPO framework using preference pairs sourced from ImageGem. Quantitative benchmarks against the Pick-a-Pic dataset demonstrate marked improvements in Pick Score, HPSv2, and CLIP Score. The dataset’s subsets (covering “cars,” “dogs,” and “scenery”) consistently outperform established baselines for human preference alignment.
Retrieval and Recommendation: The dataset is partitioned for image and model retrieval tasks. Standard algorithms—ItemKNN, Item2Vec, two-tower networks, SASRec—are assessed via Recall@k and NDCG@k. Vision-LLMs (VLMs) like Pixtral-12B are deployed to auto-generate structured textual descriptions for ranking interpretability.
Performance Gains: Metrics indicate significant enhancement in personalized retrieval accuracy and generative model recommendation when leveraging individual preference annotations.

Task	Methods Used	Metrics Improved
Preference Alignment	SD1.5 + DiffusionDPO, ImageGem pairs	Pick Score, HPSv2, CLIP Score
Personalized Retrieval	ItemKNN, Item2Vec, SASRec, VLM	Recall@k, NDCG@k

5. End-to-End Framework for Diffusion Model Editing

The editing framework proposed in ImageGem is inherently end-to-end:

SVD Standardization: Each LoRA’s weight matrix is transformed into a top-1 singular vector representation.
Latent Space Linearization: Models are represented as vectors $\theta_i \in \mathbb{R}^d$ , with PCA extracting primary components indicative of preference traits.
Linear Classifier Training: Binary labels derived from CLIP similarity comparisons are used to learn the optimal hyperplane; its normal vector $v$ serves as the latent tuning direction.
Weight Manipulation: Personalized editing is executed via $\theta_\text{edit} = \theta + \alpha \cdot v$ , applicable to any LoRA.
Aggregated Preference Alignment: The DiffusionDPO-based objective incorporates a reward function $r(c, x_0)$ evaluated along the diffusion chain and regularized by Kullback–Leibler divergence:

$\max_\theta \mathbb{E}_{(c, x_{0:T}) \sim p_\theta(x_{0:T}|c)} [r(c, x_0)] - \beta \cdot D_{\mathrm{KL}}[p_\theta(x_{0:T}|c) \| p_\mathrm{ref}(x_{0:T}|c)]$

Combined, these procedures yield a technically rigorous pipeline for the direct personalization of generative diffusion models.

6. Significance, Limitations, and Future Research

ImageGem establishes a new benchmark in fine-grained, real-world preference learning for generative image models. The scale and diversity of its interaction data render it suitable for advancing beyond aggregated preference alignment towards systems not only responsive to broad human tastes but also individually tailored to user style and intent.

The framework's principal innovation—editing in latent weight space by learned preference directions—demonstrates effective personalization and promises utility in both image generation and recommendation domains. Current limitations include dependence on PCA for latent space construction (potentially curtailing model diversity) and the challenges posed by implicit feedback extraction.

Future research directions outlined in the source include:

Incorporating richer and more dynamic implicit signals for improved preference modeling.
Generalizing LoRA latent space methods to higher-rank models and new domains (e.g., abstract or scenic imagery).
Integrating multi-modal cues for enhanced interpretability and precision in generative recommendation systems.
Refining VLM-based ranking for greater interpretive stability.

A plausible implication is that the approach may extend to other modalities and recommendation contexts, contingent on overcoming current methodological constraints and exploiting richer data sources. ImageGem thus positions itself as a foundational resource and conceptual framework for the next generation of personalized generative modeling research.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ImageGem Dataset.