Papers
Topics
Authors
Recent
Search
2000 character limit reached

AIEdiT: Text-Driven Affective Image Editing

Updated 9 December 2025
  • AIEdiT is a text-driven affective image editing framework that enables fine-grained, continuous edits to evoke precise emotional responses by mapping text to visual factors.
  • It leverages a continuous emotional spectrum and contrastive triplet optimization to align subtle emotional cues with photorealistic image outputs under rigorous MLLM supervision.
  • The multi-stage latent diffusion process and training on the EmoTIPS dataset ensure semantic clarity and robust emotional alignment, outperforming fixed-category methods.

AIEdiT is a text-driven affective image editing framework designed to evoke specific, nuanced emotions in images by adaptively shaping multiple visual and semantic factors under user-supplied textual requests. The system advances beyond prior methods that operate on coarse, discrete emotion categories or single-factor manipulations, offering continuous, fine-grained emotional edits coupled with photorealistic outputs and rigorous emotional supervision (Zhang et al., 24 May 2025).

1. Motivation and Conceptual Foundation

AIEdiT targets the task of affective image editing: given an original image IinI_{\rm in} and a user text TT describing a desired emotional outcome (e.g., "make this scene more serene and hopeful"), the framework modifies IinI_{\rm in} to produce IoutI_{\rm out} that reflects the requested affective state. This approach is motivated by the inherently ambiguous, continuous, and context-dependent nature of human emotion, which is insufficiently modeled by prior strategies relying on a small, fixed set of emotion labels or on limited editing axes (e.g., only color or facial expression). AIEdiT addresses these limitations by:

  • Learning a continuous, multi-dimensional "emotional spectrum" for nuanced affective representation;
  • Translating abstract emotional requests into visually concrete edit instructions via an "emotional mapper";
  • Supervising edits with a multimodal LLM (MLLM) to align edited content with the target emotion;
  • Utilizing a frozen, pre-trained latent diffusion model for photorealistic realization.

This design enables free-form, fine-tuned emotional edits under natural language guidance, surpassing traditional fixed-category frameworks in expressivity and precision.

2. Continuous Emotional Spectrum Construction

To represent subtle, gradated affective states, AIEdiT constructs a continuous emotional spectrum in a learned feature space.

2.1 Text and Image Encoding

User text requests are encoded by a BERT-based transformer ftextf_{\rm text}: r=ftext(T)RCt×Nlr = f_{\rm text}(T) \in \mathbb{R}^{C^t \times N^l} where CtC^t is the hidden size and NlN^l is token count. Images are characterized by a ResNet classifier pre-trained on EmoSet, predicting soft emotion distributions over NcN^c discrete emotion categories (e.g., sectors of Mikels’ wheel): d=fresnet(I)RNcd = f_{\rm resnet}(I) \in \mathbb{R}^{N^c}

2.2 Contrastive Triplet Optimization

Samples are structured as tuples s=(r,d)s = (r,d) and grouped into anchor-positive-negative triplets, with positives sharing similar Mikels’ wheel regions and negatives from opposing sectors. The model minimizes the hinge-based triplet loss: Lcl=i=1Npmax(0,  dis(sianc,sipos)dis(sianc,sineg)+α)L_{\rm cl} = \sum_{i=1}^{N^p} \max\left(0,\; \text{dis}(s_i^{\rm anc},s_i^{\rm pos}) - \text{dis}(s_i^{\rm anc},s_i^{\rm neg}) + \alpha\right) where

dis(si,sj)=rirj2didj2\text{dis}(s_i,s_j) = \frac{\|r_i - r_j\|_2}{\|d_i - d_j\|_2}

and α=0.2\alpha=0.2. This contrastive regimen aligns closely-matched affective text-image pairs while pushing apart contrasting emotions, yielding a continuous, semantically meaningful embedding space. After this procedure, ftextf_{\rm text} is frozen.

3. Emotional Mapper Design

The emotional mapper M(;θM)M(\cdot;\theta_M) translates continuous emotion embeddings into semantically actionable instructions aligned with latent diffusion spaces.

3.1 Multi-modal Inputs

The mapper receives:

  • BERT-extracted emotional embeddings rr;
  • CLIP-based text semantics h^=fclip_text(T)RCs×Nl\hat{h} = f_{\rm clip\_text}(T) \in \mathbb{R}^{C^s \times N^l};
  • Key semantic embedding fk=Wkmean(h^)RCsf^k = W_k\, \text{mean}(\hat{h}) \in \mathbb{R}^{C^s}, via a learned linear projection.

3.2 Transformer Architecture with Semantic Modulation

A stack of LL transformer layers incorporates:

  • Multi-head self-attention over [r;h^][r;\hat{h}];
  • Cross-attention from emotional to semantic channels;
  • Feedforward networks;
  • SPADE-style affine modulation of emotion features frf^r by key semantics fkf^k: fˉr=(1+W1fk)(frμσ)+W2fk\bar{f}^r = (1 + W_1 f^k) \odot \left(\frac{f^r - \mu}{\sigma}\right) + W_2 f^k where \odot denotes elementwise multiplication, (μ,σ)(\mu, \sigma) are feature statistics, and W1,W2W_1, W_2 are learned matrices. This yields the final visually-concrete semantic edit S=M(r,h^,fk;θM)RCs×NlS = M(r,\hat{h},f^k;\theta_M) \in \mathbb{R}^{C^s \times N^l}.

4. Supervision with MLLM and Training Objectives

Because fully supervised target outputs for all conceivable emotional edits are unavailable, AIEdiT leverages a pretrained multimodal LLM (ShareGPT4V) for affective supervision.

4.1 MLLM-derived Guidance

Given an edited output IrI_r, a fixed set of NrN^r prompts query the MLLM for relevant emotion-factor assessments (e.g., dominant color, object changes). Responses xirx^r_i are encoded via a CLIP text encoder ϕ()\phi(\cdot).

4.2 Sentiment and Diffusion Losses

Training objective comprises:

  • Sentiment alignment loss: Lsa=i=1Nrϕ(xt)ϕ(xir)2L_{\rm sa} = \sum_{i=1}^{N^r} \|\phi(x^{t}) - \phi(x^r_i)\|_2 where xtx^{t} is the user's target text.
  • Diffusion reconstruction loss: Ldm=Et,z0,ϵN(0,1)[ϵϵθ(zt,t,ϕ(xt))2]L_{\rm dm} = \mathbb{E}_{t,z_0,\epsilon\sim\mathcal{N}(0,1)} \left[ \|\epsilon - \epsilon_\theta(z_t, t, \phi(x^t))\|_2 \right]
  • Total loss: Ltotal=Lsa+βLdmL_{\rm total} = L_{\rm sa} + \beta L_{\rm dm} with β=10\beta=10. Only the mapper θM\theta_M is fine-tuned; the diffusion backbone and autoencoder remain frozen.

5. Inference and Editing Mechanism

During inference, AIEdiT follows a multi-stage latent diffusion workflow:

  1. Latent Encoding: Input image IinI_{\rm in} is encoded to z0z_0 via latent autoencoder.
  2. Noise Addition: A noise level tt determines edit granularity and is applied to z0z_0 to obtain ztz_t.
  3. Conditioned Denoising: Mapper-augmented denoiser uses semantic edits derived from the user's text to iteratively reconstruct z0z_0: zt1=1αt(ztϵθ(zt,t,ϕ(T)))(1αt1s=1tαs)+σtϵz_{t-1} = \frac{1}{\sqrt{\alpha_t}}(z_t - \epsilon_\theta(z_t, t, \phi(T))) \cdot \left(\frac{1-\alpha_t}{\sqrt{1-\prod_{s=1}^t\alpha_s}}\right) + \sigma_t \epsilon
  4. Decoding: Final denoised latent is decoded to IoutI_{\rm out}.

The choice of tt induces low-level (color), mid-level (object), or high-level (scene) semantic transformations. The result preserves photorealism while precisely steering multiple visual factors to evoke the specified emotion.

6. Dataset, Evaluation, and Benchmarks

AIEdiT introduces the EmoTIPS dataset for model development and assessment:

  • EmoTIPS: 1 million image–text pairs, images from EmoSet, each paired with multilevel MLLM-generated emotional descriptions emphasizing feelings.
  • Test Partition: 3,000 reserved pairs, each with annotated target emotion distribution dd^*.

Evaluation employs several quantitative and qualitative metrics:

Metric Description Target
FID Photorealism vs. real images Minimize
Semantic Clarity (Sem-C) Object/scene classification confidence (ImageNet, PLACES365) Maximize
KLD Divergence between predicted and target emotion (ResNet-50) Minimize
User Preference AMT user preference over baselines Maximize

Validation procedures include VAD-based polarity checking, image/text emotional agreement, and text–image retrieval. Human raters in four experiments (4×100 samples×25 raters) rated over 90% of model outputs as “Acceptable” or “Perfect.”

Training utilizes Stable Diffusion v1.5 backbone (frozen), Adam optimizer (5×1055 \times 10^{-5} learning rate), and dual RTX 3090 GPUs. Stage 1 (36 hours): train the continuous spectrum with LclL_{\rm cl}; Stage 2 (96 hours): train emotional mapper with LtotalL_{\rm total}.

7. Implications and Context

AIEdiT demonstrates that modeling affect on a continuous spectrum and mapping it through semantically adaptive editing instructions allows for more nuanced, context-aware, and user-driven manipulation of visual emotion. The integration of a MLLM supervisor circumvents limitations of weakly labeled or incomplete supervision, enabling robust alignment between subjective emotional requests and visual outcomes. This approach shifts the paradigm from rigid category-based editing to a spectrum-based, multi-factor modifiable framework, aligning automated image editing more closely with the gradated nature of human affect (Zhang et al., 24 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AIEdiT Framework.