AIEdiT: Text-Driven Affective Image Editing
- AIEdiT is a text-driven affective image editing framework that enables fine-grained, continuous edits to evoke precise emotional responses by mapping text to visual factors.
- It leverages a continuous emotional spectrum and contrastive triplet optimization to align subtle emotional cues with photorealistic image outputs under rigorous MLLM supervision.
- The multi-stage latent diffusion process and training on the EmoTIPS dataset ensure semantic clarity and robust emotional alignment, outperforming fixed-category methods.
AIEdiT is a text-driven affective image editing framework designed to evoke specific, nuanced emotions in images by adaptively shaping multiple visual and semantic factors under user-supplied textual requests. The system advances beyond prior methods that operate on coarse, discrete emotion categories or single-factor manipulations, offering continuous, fine-grained emotional edits coupled with photorealistic outputs and rigorous emotional supervision (Zhang et al., 24 May 2025).
1. Motivation and Conceptual Foundation
AIEdiT targets the task of affective image editing: given an original image and a user text describing a desired emotional outcome (e.g., "make this scene more serene and hopeful"), the framework modifies to produce that reflects the requested affective state. This approach is motivated by the inherently ambiguous, continuous, and context-dependent nature of human emotion, which is insufficiently modeled by prior strategies relying on a small, fixed set of emotion labels or on limited editing axes (e.g., only color or facial expression). AIEdiT addresses these limitations by:
- Learning a continuous, multi-dimensional "emotional spectrum" for nuanced affective representation;
- Translating abstract emotional requests into visually concrete edit instructions via an "emotional mapper";
- Supervising edits with a multimodal LLM (MLLM) to align edited content with the target emotion;
- Utilizing a frozen, pre-trained latent diffusion model for photorealistic realization.
This design enables free-form, fine-tuned emotional edits under natural language guidance, surpassing traditional fixed-category frameworks in expressivity and precision.
2. Continuous Emotional Spectrum Construction
To represent subtle, gradated affective states, AIEdiT constructs a continuous emotional spectrum in a learned feature space.
2.1 Text and Image Encoding
User text requests are encoded by a BERT-based transformer : where is the hidden size and is token count. Images are characterized by a ResNet classifier pre-trained on EmoSet, predicting soft emotion distributions over discrete emotion categories (e.g., sectors of Mikels’ wheel):
2.2 Contrastive Triplet Optimization
Samples are structured as tuples and grouped into anchor-positive-negative triplets, with positives sharing similar Mikels’ wheel regions and negatives from opposing sectors. The model minimizes the hinge-based triplet loss: where
and . This contrastive regimen aligns closely-matched affective text-image pairs while pushing apart contrasting emotions, yielding a continuous, semantically meaningful embedding space. After this procedure, is frozen.
3. Emotional Mapper Design
The emotional mapper translates continuous emotion embeddings into semantically actionable instructions aligned with latent diffusion spaces.
3.1 Multi-modal Inputs
The mapper receives:
- BERT-extracted emotional embeddings ;
- CLIP-based text semantics ;
- Key semantic embedding , via a learned linear projection.
3.2 Transformer Architecture with Semantic Modulation
A stack of transformer layers incorporates:
- Multi-head self-attention over ;
- Cross-attention from emotional to semantic channels;
- Feedforward networks;
- SPADE-style affine modulation of emotion features by key semantics : where denotes elementwise multiplication, are feature statistics, and are learned matrices. This yields the final visually-concrete semantic edit .
4. Supervision with MLLM and Training Objectives
Because fully supervised target outputs for all conceivable emotional edits are unavailable, AIEdiT leverages a pretrained multimodal LLM (ShareGPT4V) for affective supervision.
4.1 MLLM-derived Guidance
Given an edited output , a fixed set of prompts query the MLLM for relevant emotion-factor assessments (e.g., dominant color, object changes). Responses are encoded via a CLIP text encoder .
4.2 Sentiment and Diffusion Losses
Training objective comprises:
- Sentiment alignment loss: where is the user's target text.
- Diffusion reconstruction loss:
- Total loss: with . Only the mapper is fine-tuned; the diffusion backbone and autoencoder remain frozen.
5. Inference and Editing Mechanism
During inference, AIEdiT follows a multi-stage latent diffusion workflow:
- Latent Encoding: Input image is encoded to via latent autoencoder.
- Noise Addition: A noise level determines edit granularity and is applied to to obtain .
- Conditioned Denoising: Mapper-augmented denoiser uses semantic edits derived from the user's text to iteratively reconstruct :
- Decoding: Final denoised latent is decoded to .
The choice of induces low-level (color), mid-level (object), or high-level (scene) semantic transformations. The result preserves photorealism while precisely steering multiple visual factors to evoke the specified emotion.
6. Dataset, Evaluation, and Benchmarks
AIEdiT introduces the EmoTIPS dataset for model development and assessment:
- EmoTIPS: 1 million image–text pairs, images from EmoSet, each paired with multilevel MLLM-generated emotional descriptions emphasizing feelings.
- Test Partition: 3,000 reserved pairs, each with annotated target emotion distribution .
Evaluation employs several quantitative and qualitative metrics:
| Metric | Description | Target |
|---|---|---|
| FID | Photorealism vs. real images | Minimize |
| Semantic Clarity (Sem-C) | Object/scene classification confidence (ImageNet, PLACES365) | Maximize |
| KLD | Divergence between predicted and target emotion (ResNet-50) | Minimize |
| User Preference | AMT user preference over baselines | Maximize |
Validation procedures include VAD-based polarity checking, image/text emotional agreement, and text–image retrieval. Human raters in four experiments (4×100 samples×25 raters) rated over 90% of model outputs as “Acceptable” or “Perfect.”
Training utilizes Stable Diffusion v1.5 backbone (frozen), Adam optimizer ( learning rate), and dual RTX 3090 GPUs. Stage 1 (36 hours): train the continuous spectrum with ; Stage 2 (96 hours): train emotional mapper with .
7. Implications and Context
AIEdiT demonstrates that modeling affect on a continuous spectrum and mapping it through semantically adaptive editing instructions allows for more nuanced, context-aware, and user-driven manipulation of visual emotion. The integration of a MLLM supervisor circumvents limitations of weakly labeled or incomplete supervision, enabling robust alignment between subjective emotional requests and visual outcomes. This approach shifts the paradigm from rigid category-based editing to a spectrum-based, multi-factor modifiable framework, aligning automated image editing more closely with the gradated nature of human affect (Zhang et al., 24 May 2025).