Continuous Emotional Image Generation (C-EICG)

Updated 2 December 2025

Continuous Emotional Image Generation (C-EICG) is a paradigm that synthesizes images using continuous emotion variables, enabling smooth transitions beyond categorical labels.
It leverages conditional GANs, text-to-image diffusion models, and reinforcement learning to integrate fine-grained valence–arousal controls into the generative process.
The approach supports practical applications like expressive avatars, affective media, and iterative editing while addressing challenges in emotion continuity and model interpretability.

Continuous Emotional Image Generation (C-EICG) refers to the class of generative techniques that synthesize images conditioned on continuously variable emotional factors, extending beyond categorical emotion labels to fine-grained or spectrum-based affective control. This paradigm is essential for building emotionally nuanced avatars, affective visual communication, data augmentation for emotion analysis, and realistic portrait animation. C-EICG models leverage high-dimensional, often multi-modal conditioning and optimization, uniting advances from conditional GANs, text-to-image diffusion, reinforcement learning, and affective computing.

1. Conceptual Foundation and Problem Formulation

Traditional emotion-conditioned image generative models operate on discrete emotion labels (e.g., “happy,” “sad”), precluding smooth control or open-manifold navigation through affective space. C-EICG generalizes the conditioning variable $y$ from a one-hot vector to a continuous vector—typically in a valence-arousal (V--A) space, or via convex mixtures of categorical labels—enabling interpolation and traversal along emotional gradients (Mertes et al., 2022, He et al., 10 Jan 2025, Zhang et al., 24 May 2025, Jia et al., 25 Nov 2025).

Let $c \in \mathbb{R}^d$ denote the continuous emotion control variable—e.g., $(v,a)$ for valence and arousal, or a convex combination $y_{\text{interp}}$ of one-hot vectors. Given an auxiliary input $p$ (which may be a latent vector, text prompt, or audio), the C-EICG task is: $\text{Given } (p, c),\; \text{generate an image } x \text{ such that } \mathbb{E}[\mathcal{E}(x)] \approx c$ where $\mathcal{E}$ is an emotion predictor mapping images to their perceived affective embedding.

2. Methodological Approaches

C-EICG can be realized through several architectural and algorithmic frameworks.

2.1. Label Interpolation with Conditional GANs

Mertes et al. (Mertes et al., 2022) pioneered interpolation in one-hot label space for faces: given base emotion vectors $y_i, y_j$ , define $y_{\text{interp}} = \alpha y_i + (1-\alpha) y_j$ for $\alpha \in [0,1]$ . Both generator $G(z, y_{\text{interp}})$ and discriminator $D(x, y_{\text{interp}})$ in a DCGAN-style cGAN are conditioned on these interpolated vectors. This allows generation of faces with expressions smoothly transitioning between two emotions, as confirmed by classifier output and rater studies.

2.2. Text-to-Image Diffusion with Valence–Arousal Control

EmotiCrafter (He et al., 10 Jan 2025) introduces a pipeline that enables C-EICG directly from text and continuous $(v,a) \in [-3,3]^2$ control. The core innovation is an emotion-embedding mapping network $\mathcal{M}$ , injecting $(v,a)$ into prompt features via a transformer-based module, producing conditioned features $\hat{f}_e$ that guide a Stable Diffusion (SDXL) image generator. The network is trained by minimizing a KDE-weighted regression loss, using residual learning to amplify the emotion shift.

2.3. Reinforcement and Feedback Loops

EmoFeedback² (Jia et al., 25 Nov 2025) augments C-EICG with a reinforcement learning framework, where a vision-LLM (LVLM) acts as both evaluator (assigning a scalar reward based on how well generated images match target emotions) and adaptive prompt refiner (generating updated prompts based on generated image analysis). The generator $G_\theta$ is optimized using Group Relative Policy Optimization (GRPO), maximizing LVLM-based emotional fidelity rewards, augmented by human aesthetic preference (PickScore). At inference, the LVLM iteratively analyzes images and issues textual feedback to incrementally adjust prompts and improve emotional consistency.

2.4. Diffusion and Video Generation with Expression Spaces

For face video, both EMOdiffhead (Zhang et al., 11 Sep 2024) and DREAM-Talk (Zhang et al., 2023) model continuous facial expressions using 3DMM parameterizations (e.g., FLAME or ARKit blendshapes), with control vectors derived from linear interpolation between neutral and target emotional states. These embeddings guide video diffusion models, with temporally consistent, continuously variable expressions. EMOdiffhead incorporates both audio and expression conditioning, supporting structural interpolation of emotional intensities.

2.5. Continuous Spectrum-Based Editing

AIEdiT (Zhang et al., 24 May 2025) organizes emotional control as navigation along a learned continuous spectrum. Instructions are embedded via BERT, organized with a contrastive loss to correlate proximity in embedding space to visual affect. A semantic mapper translates nuanced “emotional requests” to visually actionable latent features for a diffusion-based editor, supporting both generation (from noise) and editing with continuous emotional adjustment.

3. Model Architectures and Conditioning Strategies

C-EICG model architectures are distinguished by their mechanism for integrating continuous emotional signals into the generative process.

Input-level conditioning (GANs, Diffusion): Continuous vectors (e.g., $y_{\text{interp}}$ , $(v,a)$ , or 3D expression embeddings) are concatenated or projected into the input stream, or supplied as additional tokens/features in diffusion U-Nets and transformers (Mertes et al., 2022, He et al., 10 Jan 2025, Zhang et al., 11 Sep 2024).
Attention-based emotion injection: Transformers conditioned on prompt features and emotion embeddings with cross-attention at every block (EmotiCrafter), supporting stronger entanglement of semantic content and emotional valence (He et al., 10 Jan 2025).
Scale-and-shift normalization: Emotion and audio embeddings modulate group-norm parameters in diffusion models, as in EMOdiffhead, facilitating detailed modulation of generated frame features (Zhang et al., 11 Sep 2024).
Feedback reinforcement: LVLMs return reward scores and iterative prompt modifications, integrating both gradient-free reinforcement and human-in-the-loop corrections (Jia et al., 25 Nov 2025).
Spectrum navigation: High-dimensional spectrum learned by contrastive alignment enables continuous, multi-directional control over nuanced affect dimensions, e.g., via smooth trajectories in embedding space (Zhang et al., 24 May 2025).

4. Datasets, Evaluation Metrics, and Experimental Protocols

C-EICG models require detailed, affectively annotated datasets and specialized evaluation metrics.

Datasets

Face Generation: FACES (Mertes et al., 2022), MEAD, HDTF (emotion-balanced/realistic), and ARKit/FLAME blendshape sequences (Zhang et al., 11 Sep 2024, Zhang et al., 2023).
Image Content & Editing: OASIS, EMOTIC, FindingEmo merged with continuous VA labels (He et al., 10 Jan 2025); EmoSet and the large-scale EmoTIPS text–image–emotion dataset for spectrum training (Zhang et al., 24 May 2025).
Custom datasets with LVLM labeling: EmoFeedback² defines emotion ground-truth in continuous VA by mapping class-wise Gaussians to discrete labels (Jia et al., 25 Nov 2025).

Evaluation Metrics

Emotion alignment: V-Error, A-Error (mean absolute error between predicted and target valence/arousal); classifier softmax confidence curves for expression transitions.
Perceptual quality: CLIPScore, CLIP-IQA, LPIPS-Continuous (adjacent image perceptual distances), FID (Inception), Semantic Clarity.
Temporal/video coherence: FVD, LPIPS (video), landmark distances, lip-sync accuracy (LSE-D/LSE-C).
User studies: 5-point Likert ratings for emotional expressiveness, visual realism, and smoothness; Kendall’s $\tau_b$ for ranking consistency.
Spectrum-specific: KL divergence between model-predicted and target emotion distributions (AIEdiT); ablations on embedding structure and supervision.

5. Applications and Use Cases

C-EICG supports a wide range of applications:

Facial Avatar Synthesis: Data augmentation for facial emotion recognition or HCI avatars, with smooth, photorealistic transitions for affective interaction (Mertes et al., 2022).
Affective Media Generation: Text-prompted image content with fine-grained control, supporting affective advertising and storytelling (He et al., 10 Jan 2025, Zhang et al., 24 May 2025).
Expressive Talking Head Video: Generating animated portraits with synchronized speech and user-controlled affect, targeting virtual agents and video conferencing enhancement (Zhang et al., 11 Sep 2024, Zhang et al., 2023).
Iterative Emotional Editing: Real-time, user-in-the-loop image and prompt refinement for creative control and image enhancement (Jia et al., 25 Nov 2025).
Affective Image Editing: Semantic manipulation of images along the emotional spectrum for mood-specific photo editing and stylization (Zhang et al., 24 May 2025).

6. Limitations and Future Directions

Current C-EICG systems encounter several open challenges:

Linearity of emotional paths: Real-world affective transitions may not correspond to linear traversals in embedding or VA spaces; current interpolations occasionally display non-monotonicities (Mertes et al., 2022).
Annotation quality and dimensionality: Control over arousal remains less reliable than valence; ambiguity and subjectivity in emotion labeling persist across datasets (He et al., 10 Jan 2025).
Model bias and semantic drift: Over-reliance on human-centric scenes or prompt drift at extreme affective amplifications; explicit semantic preservation losses remain a future direction (He et al., 10 Jan 2025, Zhang et al., 24 May 2025).
Granularity of control: Extensions beyond categorical or VA axes to compound and high-dimensional affective constructs are underexplored (Zhang et al., 11 Sep 2024).
Interpretability of reward and decision process: The internal logic of LVLM-based reward assignment and error-prompts is not transparent; interpretable process reward models are an open question (Jia et al., 25 Nov 2025).
Temporal coherency and multi-modal generalization: Real-time and multimodal storytelling, dynamic backgrounds, and cross-lingual affective generalization are not yet fully addressed in video-based C-EICG (Zhang et al., 11 Sep 2024, Zhang et al., 2023).
Personalization and user-aligned affect calibration: There is a need for adaptive calibration of predictors and feedback models to individual user affective profiles (Jia et al., 25 Nov 2025).

Promising directions include integrating semantic-preservation losses, expanding emotion descriptors beyond VA into richer dimensional ontologies, augmenting training datasets with underrepresented affective states, and embedding C-EICG within interactive, closed-loop user workflows in creative and communication platforms.