AIEdiT: Text-Driven Affective Image Editing

Updated 9 December 2025

AIEdiT is a text-driven affective image editing framework that enables fine-grained, continuous edits to evoke precise emotional responses by mapping text to visual factors.
It leverages a continuous emotional spectrum and contrastive triplet optimization to align subtle emotional cues with photorealistic image outputs under rigorous MLLM supervision.
The multi-stage latent diffusion process and training on the EmoTIPS dataset ensure semantic clarity and robust emotional alignment, outperforming fixed-category methods.

AIEdiT is a text-driven affective image editing framework designed to evoke specific, nuanced emotions in images by adaptively shaping multiple visual and semantic factors under user-supplied textual requests. The system advances beyond prior methods that operate on coarse, discrete emotion categories or single-factor manipulations, offering continuous, fine-grained emotional edits coupled with photorealistic outputs and rigorous emotional supervision (Zhang et al., 24 May 2025).

1. Motivation and Conceptual Foundation

AIEdiT targets the task of affective image editing: given an original image $I_{\rm in}$ and a user text $T$ describing a desired emotional outcome (e.g., "make this scene more serene and hopeful"), the framework modifies $I_{\rm in}$ to produce $I_{\rm out}$ that reflects the requested affective state. This approach is motivated by the inherently ambiguous, continuous, and context-dependent nature of human emotion, which is insufficiently modeled by prior strategies relying on a small, fixed set of emotion labels or on limited editing axes (e.g., only color or facial expression). AIEdiT addresses these limitations by:

Learning a continuous, multi-dimensional "emotional spectrum" for nuanced affective representation;
Translating abstract emotional requests into visually concrete edit instructions via an "emotional mapper";
Supervising edits with a multimodal LLM (MLLM) to align edited content with the target emotion;
Utilizing a frozen, pre-trained latent diffusion model for photorealistic realization.

This design enables free-form, fine-tuned emotional edits under natural language guidance, surpassing traditional fixed-category frameworks in expressivity and precision.

2. Continuous Emotional Spectrum Construction

To represent subtle, gradated affective states, AIEdiT constructs a continuous emotional spectrum in a learned feature space.

2.1 Text and Image Encoding

User text requests are encoded by a BERT-based transformer $f_{\rm text}$ : $r = f_{\rm text}(T) \in \mathbb{R}^{C^t \times N^l}$ where $C^t$ is the hidden size and $N^l$ is token count. Images are characterized by a ResNet classifier pre-trained on EmoSet, predicting soft emotion distributions over $N^c$ discrete emotion categories (e.g., sectors of Mikels’ wheel): $d = f_{\rm resnet}(I) \in \mathbb{R}^{N^c}$

2.2 Contrastive Triplet Optimization

Samples are structured as tuples $s = (r,d)$ and grouped into anchor-positive-negative triplets, with positives sharing similar Mikels’ wheel regions and negatives from opposing sectors. The model minimizes the hinge-based triplet loss: $L_{\rm cl} = \sum_{i=1}^{N^p} \max\left(0,\; \text{dis}(s_i^{\rm anc},s_i^{\rm pos}) - \text{dis}(s_i^{\rm anc},s_i^{\rm neg}) + \alpha\right)$ where

$\text{dis}(s_i,s_j) = \frac{\|r_i - r_j\|_2}{\|d_i - d_j\|_2}$

and $\alpha=0.2$ . This contrastive regimen aligns closely-matched affective text-image pairs while pushing apart contrasting emotions, yielding a continuous, semantically meaningful embedding space. After this procedure, $f_{\rm text}$ is frozen.

3. Emotional Mapper Design

The emotional mapper $M(\cdot;\theta_M)$ translates continuous emotion embeddings into semantically actionable instructions aligned with latent diffusion spaces.

The mapper receives:

BERT-extracted emotional embeddings $r$ ;
CLIP-based text semantics $\hat{h} = f_{\rm clip\_text}(T) \in \mathbb{R}^{C^s \times N^l}$ ;
Key semantic embedding $f^k = W_k\, \text{mean}(\hat{h}) \in \mathbb{R}^{C^s}$ , via a learned linear projection.

3.2 Transformer Architecture with Semantic Modulation

A stack of $L$ transformer layers incorporates:

Multi-head self-attention over $[r;\hat{h}]$ ;
Cross-attention from emotional to semantic channels;
Feedforward networks;
SPADE-style affine modulation of emotion features $f^r$ by key semantics $f^k$ : $\bar{f}^r = (1 + W_1 f^k) \odot \left(\frac{f^r - \mu}{\sigma}\right) + W_2 f^k$ where $\odot$ denotes elementwise multiplication, $(\mu, \sigma)$ are feature statistics, and $W_1, W_2$ are learned matrices. This yields the final visually-concrete semantic edit $S = M(r,\hat{h},f^k;\theta_M) \in \mathbb{R}^{C^s \times N^l}$ .

4. Supervision with MLLM and Training Objectives

Because fully supervised target outputs for all conceivable emotional edits are unavailable, AIEdiT leverages a pretrained multimodal LLM (ShareGPT4V) for affective supervision.

4.1 MLLM-derived Guidance

Given an edited output $I_r$ , a fixed set of $N^r$ prompts query the MLLM for relevant emotion-factor assessments (e.g., dominant color, object changes). Responses $x^r_i$ are encoded via a CLIP text encoder $\phi(\cdot)$ .

4.2 Sentiment and Diffusion Losses

Training objective comprises:

Sentiment alignment loss: $L_{\rm sa} = \sum_{i=1}^{N^r} \|\phi(x^{t}) - \phi(x^r_i)\|_2$ where $x^{t}$ is the user's target text.
Diffusion reconstruction loss: $L_{\rm dm} = \mathbb{E}_{t,z_0,\epsilon\sim\mathcal{N}(0,1)} \left[ \|\epsilon - \epsilon_\theta(z_t, t, \phi(x^t))\|_2 \right]$
Total loss: $L_{\rm total} = L_{\rm sa} + \beta L_{\rm dm}$ with $\beta=10$ . Only the mapper $\theta_M$ is fine-tuned; the diffusion backbone and autoencoder remain frozen.

5. Inference and Editing Mechanism

During inference, AIEdiT follows a multi-stage latent diffusion workflow:

Latent Encoding: Input image $I_{\rm in}$ is encoded to $z_0$ via latent autoencoder.
Noise Addition: A noise level $t$ determines edit granularity and is applied to $z_0$ to obtain $z_t$ .
Conditioned Denoising: Mapper-augmented denoiser uses semantic edits derived from the user's text to iteratively reconstruct $z_0$ : $z_{t-1} = \frac{1}{\sqrt{\alpha_t}}(z_t - \epsilon_\theta(z_t, t, \phi(T))) \cdot \left(\frac{1-\alpha_t}{\sqrt{1-\prod_{s=1}^t\alpha_s}}\right) + \sigma_t \epsilon$
Decoding: Final denoised latent is decoded to $I_{\rm out}$ .

The choice of $t$ induces low-level (color), mid-level (object), or high-level (scene) semantic transformations. The result preserves photorealism while precisely steering multiple visual factors to evoke the specified emotion.

6. Dataset, Evaluation, and Benchmarks

AIEdiT introduces the EmoTIPS dataset for model development and assessment:

EmoTIPS: 1 million image–text pairs, images from EmoSet, each paired with multilevel MLLM-generated emotional descriptions emphasizing feelings.
Test Partition: 3,000 reserved pairs, each with annotated target emotion distribution $d^*$ .

Evaluation employs several quantitative and qualitative metrics:

Metric	Description	Target
FID	Photorealism vs. real images	Minimize
Semantic Clarity (Sem-C)	Object/scene classification confidence (ImageNet, PLACES365)	Maximize
KLD	Divergence between predicted and target emotion (ResNet-50)	Minimize
User Preference	AMT user preference over baselines	Maximize

Validation procedures include VAD-based polarity checking, image/text emotional agreement, and text–image retrieval. Human raters in four experiments (4×100 samples×25 raters) rated over 90% of model outputs as “Acceptable” or “Perfect.”

Training utilizes Stable Diffusion v1.5 backbone (frozen), Adam optimizer ( $5 \times 10^{-5}$ learning rate), and dual RTX 3090 GPUs. Stage 1 (36 hours): train the continuous spectrum with $L_{\rm cl}$ ; Stage 2 (96 hours): train emotional mapper with $L_{\rm total}$ .

7. Implications and Context

AIEdiT demonstrates that modeling affect on a continuous spectrum and mapping it through semantically adaptive editing instructions allows for more nuanced, context-aware, and user-driven manipulation of visual emotion. The integration of a MLLM supervisor circumvents limitations of weakly labeled or incomplete supervision, enabling robust alignment between subjective emotional requests and visual outcomes. This approach shifts the paradigm from rigid category-based editing to a spectrum-based, multi-factor modifiable framework, aligning automated image editing more closely with the gradated nature of human affect (Zhang et al., 24 May 2025).

Markdown Upgrade to Chat

References (1)

Affective Image Editing: Shaping Emotional Factors via Text Descriptions (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AIEdiT Framework.

AIEdiT: Text-Driven Affective Image Editing

1. Motivation and Conceptual Foundation

2. Continuous Emotional Spectrum Construction

2.1 Text and Image Encoding

2.2 Contrastive Triplet Optimization

3. Emotional Mapper Design

3.2 Transformer Architecture with Semantic Modulation

4. Supervision with MLLM and Training Objectives

4.1 MLLM-derived Guidance

4.2 Sentiment and Diffusion Losses

5. Inference and Editing Mechanism

6. Dataset, Evaluation, and Benchmarks

7. Implications and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

AIEdiT: Text-Driven Affective Image Editing

1. Motivation and Conceptual Foundation

2. Continuous Emotional Spectrum Construction

2.1 Text and Image Encoding

2.2 Contrastive Triplet Optimization

3. Emotional Mapper Design

3.1 Multi-modal Inputs

3.2 Transformer Architecture with Semantic Modulation

4. Supervision with MLLM and Training Objectives

4.1 MLLM-derived Guidance

4.2 Sentiment and Diffusion Losses

5. Inference and Editing Mechanism

6. Dataset, Evaluation, and Benchmarks

7. Implications and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research