Muse: Attribute-Aware Portrait Generation

Updated 7 October 2025

Muse is a portrait painting generation framework that integrates text-guided facial attributes with real photo features to create semantically faithful artworks.
The architecture employs a UNet-based image encoder and an attribute encoder with multi-task loss, achieving a 6% IS increase and an 11% FID reduction over baselines.
Attribute-aware losses enforce semantic consistency across 11 distinct attributes, enabling coherent multi-attribute manipulations and diverse artistic outputs.

Muse is a framework for portrait painting generation that integrates textual attributes with facial features from a real photo to produce visually and semantically faithful digital artworks. Unlike conventional style transfer methods, MUSE leverages detailed, semantically rich attribute guidance to control the generation process, enabling the model to capture not only external appearance but also abstract properties such as mood, weather, and artistic style. The architecture is based on an attribute-aware, stacked neural network extension to a standard image-to-image UNet, with a multi-task loss for both adversarial realism and attribute consistency. Empirical results show that MUSE significantly improves over baselines in both visual fidelity (as measured by Inception Score and FID) and attribute preservation.

1. Attribute Typology and Semantic Guidance

MUSE’s generative process is steered by 11 distinct attribute types, aiming to capture the full range of informational and inspirational cues that human portraiture often seeks to encode. The categorical attribute types are:

Age (e.g., Child, Young adults, Middle-aged adults, Older adults)
Clothing (detailed categories such as Blazer, Dress, Religious Robe, etc.)
Facial Expression (multiple fine-grained descriptors: Smile, Smirk, Wrinkle the nose, Glance)
Gender (Male, Female, Other)
Hair (style, color, length; e.g., Blond hair, Wavy hair)
Mood (emotional state: Calm, Excited, Sad, Apathetic, etc.)
Pose/Gesture (e.g., Sitting, Bowing, Shooting, Riding)
Setting (e.g., In a room, On the street, Outdoors)
Style (artistic style: Impressionism, Modernism, Chinese painting)
Time (Before or After 1970)
Weather (Sunny, Stormy, Foggy, Snow, etc.)

Each of these attributes is learned as a semantic embedding and jointly encodes interdependencies, allowing the model to reflect not just explicit signals (e.g., “black hair”) but also compound semantics such as the interplay between mood, weather, and artistic style.

2. Architecture and Attribute-Image Fusion

The architecture consists of two main modules: an image encoder (using a UNet backbone) and an attribute encoder mapping text-based type–value pairs to learned embeddings. The integration layer performs a fusion operation: $h^a = \sigma(\mathbf{W}_h h + \mathbf{W}_v v + b)$ where $h$ is the encoded image representation, $v$ is the vector of concatenated attribute embeddings, $\mathbf{W}_h$ and $\mathbf{W}_v$ are trainable matrices, $b$ is a bias term, and $\sigma$ is a nonlinearity (ReLU). The decoder reconstructs a portrait painting from $h^a$ .

In addition, a multi-task discriminator is implemented: it not only distinguishes real from fake images (adversarial loss $L_a$ ) but also provides an attribute classification branch with a loss term $L_c$ that ensures the generated image resides in the correct attribute class.

3. Performance Metrics and Quantitative Evaluation

MUSE is evaluated using the Inception Score (IS) and Fréchet Inception Distance (FID):

Inception Score (IS): Measures both the diversity and the clarity of generated images. MUSE achieves a 6% increase over the baseline, reflecting improved semantic variety and realism.
Fréchet Inception Distance (FID): Assesses similarity between real and synthesized image distributions via mean and covariance statistics in feature space. MUSE reduces FID by 11% relative to the baseline, indicating closer alignment with the true data manifold.

A dedicated attribute reconstruction metric is introduced: a classifier $F$ (trained on the portrait dataset) predicts attributes from generated images; these are compared to ground-truth input attributes via an F-score, computed as a binary cross-entropy over all attributes. MUSE achieves correct illustration for ~78% of textual attributes, outperforming attribute-ignorant baselines by a substantive margin.

4. Qualitative Behavior, Inter-Attribute Dependencies, and Limitations

Experiments include both single-attribute modifications (e.g., editing only gender or mood) and complex, multi-attribute manipulations. Results show that MUSE not only enforces explicit attribute changes (e.g., smiling vs. blank expression, summer vs. winter clothing), but also captures latent interdependencies—combining “sad mood” with “stormy weather” produces coherent background and expression modifications.

The model demonstrates affordance awareness: semantically inconsistent combinations (such as a “smile” with a “sad” mood) are rarely realized, reflecting the attribute entanglement learned during training. However, there are limits in separating certain highly correlated or weakly visualizable attributes, particularly when attribute labels induce ambiguous or conflicting visual cues.

5. Attribute-Aware Losses and Discriminative Training

The adversarial training regime is augmented by an attribute consistency loss: $\mathcal{L}_{attr} = \sum_a \text{BCE}\left(F(\hat{x})_a, y_a\right)$ where $F(\hat{x})_a$ is the predicted probability that generated image $\hat{x}$ contains attribute $a$ , and $y_a$ is the ground-truth binary vector. The total training loss thus combines reconstruction, adversarial, and attribute-consistency terms: $\mathcal{L}_{total} = \mathcal{L}_{adv} + \lambda_{c}\mathcal{L}_{cls} + \lambda_{attr}\mathcal{L}_{attr}$ Design of the attribute-aware discriminator ensures the generator cannot “game” attribute appearance with only superficial changes.

6. Applications and Artistic Implications

By enabling nuanced, multi-attribute control, MUSE makes it possible to encode not just the subject’s visage but also narrative context and artistic intention directly into the image synthesis process. For digital artists and computational creatives, this provides a new mechanism for iterative exploration: abstract ideas—ranging from “calm child in a rainy Impressionist style” to “middle-aged adult, modernist, after 1970, sunny” —are reliably translatable into image space. The generative process is thus elevated from low-level style transfer to semantic portraiture.

Generally, the model establishes a blueprint for future systems that accept high-level, natural language–style semantic guidance, supporting both photorealistic and stylized output with an interpretable correspondence between user intent and generative output.

7. Broader Impact and Future Directions

MUSE demonstrates that direct, attribute-driven image generation is feasible and achieves significant gains in both fidelity and control over mainstream baselines such as AttGAN and StarGAN. The compositionality and robustness of the attribute embeddings suggest that further integrating free-form textual cues or story elements could expand expressivity even further. Future work may extend this paradigm to other figurative domains or complex scenes, or incorporate richer narratives beyond categorical attributes, effectively closing the gap between semantic description and artistic image synthesis.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Muse.