S²Edit: Identity-Preserving Image Editing

Updated 4 July 2026

The paper introduces S²Edit, a method that uses a localized identity token and token-specific cross-attention masking to achieve precise, identity-preserving edits.
It employs a two-stage pipeline with per-identity fine-tuning and inference using Null-text inversion to balance identity preservation, prompt alignment, and spatial control.
Empirical results show S²Edit achieves superior FID, LPIPS, and PSNR scores on faces and non-face objects compared to existing diffusion-based editing approaches.

S $^2$ Edit is a text-guided image editing method based on latent diffusion that targets a particularly constrained regime of controllable generation: precise, localized modification of real images, especially faces, while preserving identity and high-frequency detail under natural-language instructions. Its central premise is that high-fidelity personalized editing requires a learned identity representation that is both semantically disentangled from editable attributes and spatially restricted to the relevant object region. To this end, S $^2$ Edit combines per-identity fine-tuning, an orthogonality constraint in textual feature space, and token-specific cross-attention masking within Stable Diffusion v1-4 (Liu et al., 7 Jul 2025).

1. Problem formulation and design objective

The task addressed by S $^2$ Edit is: given a real input image $\mathcal{I}$ and a text edit prompt $\mathcal{P}^\*$, generate an edited image $\mathcal{I}^\*$ that preserves identity and high-frequency details, applies only the requested attribute changes, and localizes edits spatially. The motivating examples are face edits such as adding bangs, changing expression, or applying lipstick, where small failures in geometry, texture, or region specificity are visually salient (Liu et al., 7 Jul 2025).

The method is organized around a three-way trade-off: identity preservation, prompt alignment, and spatial or localized control. In the described problem setting, naive applications of existing diffusion-based editing and personalization methods are said to fail for several distinct reasons. Prompt-based editing methods such as Prompt-to-Prompt, Null-text inversion, and Imagic may regenerate large portions of the image, causing identity drift and weak locality. Personalization methods such as DreamBooth, Textual Inversion, and Custom Diffusion can learn an identity token, but that token is often entangled with training-time attributes and lacks explicit spatial control, so edits may spread to background or irrelevant regions (Liu et al., 7 Jul 2025).

S $^2$ Edit addresses this by learning what may be called an Editor's term “localized identity token”: a personalized token [I] intended to encode identity-specific information while remaining disentangled from editable attributes and confined to the object region of interest. This suggests that the method treats identity not merely as a global semantic concept, but as a controlled conditioning signal whose semantics and spatial support are both explicitly regularized.

2. Two-stage pipeline and personalized identity representation

S $^2$ Edit uses a two-stage pipeline. The first stage is identity fine-tuning. Starting from a single source image $\mathcal{I}$ and a source prompt $\mathcal{P}$ , the method inserts a special identity token [I] into the prompt, producing an enhanced prompt $^2$ 0. The embedding $^2$ 1 is randomly initialized and optimized during fine-tuning. Both the Stable Diffusion v1-4 UNet and the CLIP-like text encoder are fine-tuned, rather than only the token or cross-attention parameters, because parameter-efficient variants were found to be suboptimal for this task (Liu et al., 7 Jul 2025).

The fine-tuning objective combines standard diffusion reconstruction with semantic regularization:

$^2$ 2

with $^2$ 3. The reconstruction term is the standard latent diffusion loss over noisy latents and predicted noise, conditioned on the prompt containing [I]. The intended effect is that [I] captures identity from a single image and becomes reusable across multiple future prompts such as “a [I] lady with bangs” or “a [I] woman, smiling, with red lipstick” (Liu et al., 7 Jul 2025).

The second stage is inference-time editing. After fine-tuning, the model is frozen and the original image is inverted by Null-text inversion into an initial latent noise state. A target prompt $^2$ 4 is then formed by combining [I] with new attributes. Denoising proceeds from the inverted latent using a Prompt-to-Prompt-style cross-attention injection scheme, while reapplying the spatial masking mechanism learned during training. In this way, the method attempts localized editing from a real-image latent, rather than unconstrained regeneration from noise (Liu et al., 7 Jul 2025).

A notable consequence of this design is that fine-tuning and inversion are performed once per identity and source image, after which multiple edits can be produced without retraining. This suggests a workflow closer to personalized editing than to one-shot instruction following.

3. Semantic disentanglement and spatial control

The semantic component of S $^2$ 5Edit is an orthogonality constraint between the learned identity token and the source prompt embedding. Let $^2$ 6 denote the identity token embedding and $^2$ 7 the embedding of the original source prompt. The semantic loss is defined as

$^2$ 8

with the stated intent of forcing $^2$ 9 to be orthogonal to $^2$ 0 (Liu et al., 7 Jul 2025).

The rationale is that if [I] aligns with the prompt embedding, it may absorb attributes that should remain editable, such as glasses, hair color, or expression. Orthogonality therefore pushes identity information into a different subspace than the textual attributes specified by the source prompt. A plausible implication is that later prompt substitutions, such as replacing “no bangs” with “bangs,” are less likely to be overridden by the identity representation itself.

The spatial component is implemented by masking the cross-attention map of [I]. The method first derives a coarse binary object mask $^2$ 1 from the cross-attention map of the prompt word that refers to the object of interest, such as “lady,” “cat,” or “church.” It then masks the identity token’s attention as

$^2$ 2

This masking is applied during both fine-tuning and inference, so [I] learns to influence only the object region rather than the entire image (Liu et al., 7 Jul 2025).

This differs from pixel-space inpainting. The edited region is not enforced by masking pixels at the input or output; rather, spatial restriction is imposed on the attention weights of a specific conditioning token. Within the paper’s formulation, this is the mechanism that turns [I] from a global personalization token into a spatially focused identity carrier.

4. Training configuration, inference procedure, and compositional editing

The implementation uses Stable Diffusion v1-4 as the base model, full fine-tuning of the UNet and text encoder, AdamW with learning rate $^2$ 3, 200 fine-tuning steps, DDIM with 50 inference steps, and a single NVIDIA A100. Reported timings are approximately 95 seconds for fine-tuning, 113 seconds for Null-text inversion, and 9 seconds for each edited image generation. The method is evaluated on FFHQ and CelebA for faces, and on AFHQ, LSUN cat, and LSUN church for non-face objects (Liu et al., 7 Jul 2025).

The inference pipeline is specified in six steps: prepare a source prompt $^2$ 4, fine-tune with [I] and the two control mechanisms, invert the original image, construct the target edit prompt $^2$ 5, denoise with classifier-free guidance scale $^2$ 6, and decode with the VAE. The reported range $^2$ 7 is said to balance identity preservation and edit strength. Lower values preserve identity more strongly but weaken edits; higher values strengthen edits but increase the risk of identity drift (Liu et al., 7 Jul 2025).

Beyond attribute editing, S $^2$ 8Edit includes a compositional editing mode exemplified by makeup transfer. In that setting, a second learned token [A] is introduced for an attribute from a reference image. [I] encodes source identity, [A] encodes makeup style, and a mixed prompt such as “a [I] lady with [A] makeup” is used to synthesize the final image. The method jointly fine-tunes on the source and reference image–prompt pairs while applying semantic and spatial control to both tokens (Liu et al., 7 Jul 2025).

The compositional setup indicates that S $^2$ 9Edit is not limited to toggling attributes already latent in a source image. It can also combine independently learned subject and attribute tokens within the same prompt space, provided both are disentangled and spatially regulated.

5. Empirical performance and ablation results

Quantitative evaluation on FFHQ uses 150 images and 10 prompts per image, with FID, LPIPS, and PSNR as the main metrics. The reported table is as follows (Liu et al., 7 Jul 2025):

Method	FID $\mathcal{I}$ 0	LPIPS $\mathcal{I}$ 1	PSNR $\mathcal{I}$ 2
Null-text Inversion	67.61	0.18	30.29
InstructPix2Pix	56.98	0.15	30.48
SINE	107.56	0.38	28.56
DeltaEdit	86.41	0.30	29.01
S $\mathcal{I}$ 3Edit	52.31	0.13	30.75

These results place S $\mathcal{I}$ 4Edit first on all three metrics: best image quality by FID, best identity preservation by LPIPS, and best reconstruction fidelity by PSNR. The paper also reports a CLIP-versus-LPIPS trade-off analysis in which S $\mathcal{I}$ 5Edit attains higher CLIP scores than baselines over a wide range of LPIPS values, indicating a better balance between prompt alignment and identity preservation (Liu et al., 7 Jul 2025).

In the user study, 40 participants evaluated 40 questions and selected which outputs best preserved identity and best matched the prompt. S $\mathcal{I}$ 6Edit received 71.38% preference for identity preservation and 72.38% for prompt alignment, compared with 27.75% and 26.00% for Null-text inversion, 33.13% and 35.00% for InstructPix2Pix, 0.50% and 10.75% for SINE, and 30.00% and 49.75% for DeltaEdit (Liu et al., 7 Jul 2025).

The ablation study isolates the role of each component. Null-text inversion alone produces large changes and loses identity. Adding identity fine-tuning preserves identity but can fail to realize requested attributes, as in failed bangs edits. Adding semantic control improves attribute editability, but identity may still degrade if spatial control is absent. The full model, combining identity fine-tuning, semantic control, and spatial control, is reported to preserve identity, realize target attributes, and minimize background change. A separate guidance-scale ablation shows that stronger guidance makes attributes such as “angry” more pronounced, but also increases the risk of identity drift (Liu et al., 7 Jul 2025).

6. Relation to adjacent editing paradigms, applications, and limitations

S $\mathcal{I}$ 7Edit occupies a narrower but more identity-sensitive regime than generalized editing frameworks. EditGAN formulates high-precision semantic image editing in GAN latent space through segmentation-driven latent optimization and reusable editing vectors, emphasizing mask-level control rather than text-conditioned personalized diffusion editing (Ling et al., 2021). AnyEdit and its AnySD model aim at unified instruction-based image editing across more than 20 editing types and five domains, prioritizing breadth of task coverage rather than single-image identity preservation (Yu et al., 2024). EditAR similarly pursues a unified conditional generator, but does so with a single autoregressive tokenization framework for editing and modality translation tasks such as depth-to-image and segmentation-to-image (Mu et al., 8 Jan 2025). A $\mathcal{I}$ 8-Edit, by contrast, is a reference-guided inpainting framework for arbitrary object categories and ambiguous masks, centered on coarse-mask robustness and reference-object replacement rather than learned identity tokens for localized text-guided edits (Zheng et al., 11 Mar 2026).

Within its own scope, S $\mathcal{I}$ 9Edit demonstrates applications beyond faces. The paper reports non-face editing on AFHQ cats and LSUN churches, including fur-color changes, lighting adjustments, snow addition, and time-of-day changes. It also emphasizes that faces remain the primary evaluation domain because human perception is highly sensitive to small identity changes (Liu et al., 7 Jul 2025).

The method’s limitations follow directly from its design. It requires a user-provided source prompt $\mathcal{P}^\*$0 for the original image, which can be inconvenient or ambiguous. Its object mask is derived from attention to an object word in the prompt, so inaccurate prompt wording or noisy attention can degrade spatial control. The workflow also incurs nontrivial cost because each identity requires fine-tuning and inversion, even though the reported times remain moderate. Finally, evaluation is concentrated on faces and relatively simple objects, so the behavior on very complex scenes or heavy structural edits is not established (Liu et al., 7 Jul 2025).

Taken together, S$\mathcal{P}^\*$1Edit can be understood as a specialized diffusion-editing formulation for a precise problem: preserving “who” while altering “what,” and doing so only “where” intended. Its main technical claim is that those three constraints become jointly tractable when identity is represented as a learned token whose semantics are orthogonalized and whose spatial action is explicitly masked (Liu et al., 7 Jul 2025).