Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniEdit-I: Unified Multimodal Editing

Updated 3 July 2026
  • UniEdit-I is a unified, tuning-free framework that edits images and videos by manipulating high-level semantic representations without any model fine-tuning.
  • It employs a closed-loop UEV process integrating semantic understanding, latent-space diffusion, and verification to ensure artifact-free, semantically aligned outputs.
  • The approach achieves state-of-the-art performance in both image and video editing by leveraging pretrained vision-language models and diffusion transformers on CLIP embeddings.

UniEdit-I denotes a class of unified, tuning-free editing frameworks for multimodal manipulation, with significant contributions in both image editing for vision-LLMs (VLMs) and video editing leveraging pre-trained diffusion backbones. UniEdit-I approaches represent a paradigm shift from task- and dataset-specific pipelines toward closed-loop, conceptual editing within high-level feature spaces, achieving artifact-free, semantically aligned outputs and state-of-the-art quantitative performance—all without any model fine-tuning.

1. Unified Multimodal Editing in High-Level Feature Space

UniEdit-I (Bai et al., 5 Aug 2025, Bai et al., 2024) is designed to operate entirely in the semantic latent space shared by vision-LLMs, notably leveraging representations such as CLIP embeddings. Unlike conventional pixel- or VAE-space editors, UniEdit-I modifies conceptual representations, ensuring that generated outputs are both semantically coherent with the editing instruction and visually plausible. This approach fundamentally closes the feedback loop between semantic interpretation and low-level generative processes, transforming the VLM from a passive evaluator into an active, in-process conductor of the editing trajectory.

The base architecture typically includes:

  • A frozen vision encoder mapping images to semantic (e.g., CLIP) features.
  • A text encoder for prompt/instruction embedding.
  • A representation autoencoder (RAE) head to invert semantic features back to pixels.
  • A pretrained diffusion transformer operating directly in the high-level feature space.
  • A closed-loop controller—the UEV (Understanding–Editing–Verifying) loop—for real-time semantic feedback.

2. Closed-Loop UEV Editing Process

The core of UniEdit-I is an iterative, training-free UEV procedure:

  1. Understanding: The vision-LLM parses both the source image and edit instruction, extracting semantic context and the target concept embedding.
  2. Editing (Diffusion in Latent Space): The editing variable ZeditTZ_{\text{edit}}^T is initialized as the CLIP features of the source. At each reverse-diffusion step, semantic gradients are computed via the diffusion transformer. The update incorporates a semantic velocity term (difference between the denoised features conditioned on the source vs. the target instruction). Adaptive gain scaling, based on real-time feedback from the VLM, dynamically directs the denoising path.
  3. Verifying: After a fixed number of steps (or when thresholds are met), the current decode is verified by the VLM against the target instruction, using metrics such as semantic consistency and perceptual alignment. If criteria are not met, discrepancy prompts are generated and the loop is restarted from the best intermediate.

This real-time integration of semantic reasoning directly into the generation pipeline is unique: all guidance and checking is performed using high-level feature-space similarity without external losses or task-specific adapters.

UEV Loop Pseudocode

Ledit(z,C)=CLIPSim(z,Etext(C))\mathcal{L}_{\text{edit}}(z, C) = -\operatorname{CLIPSim}(z, E_{\text{text}}(C))7 All weights remain frozen; no backpropagation or fine-tuning is employed.

3. Semantic-Space Diffusion and Loss Formulations

Editing is executed by running a diffusion process directly in semantic latent space. The forward process adds noise to the CLIP embedding, and the reverse process incrementally denoises under the guidance of the target instruction embedding.

  • Semantic Alignment Loss: Ledit(z,C)=CLIPSim(z,Etext(C))\mathcal{L}_{\text{edit}}(z, C) = -\operatorname{CLIPSim}(z, E_{\text{text}}(C)), enforcing alignment between intermediate sample and textual instruction.
  • Verification Loss: Lverify(I,C)=1CLIPSim(Esem(I),Etext(C))\mathcal{L}_{\text{verify}}(I, C) = 1 - \operatorname{CLIPSim}(E_{\text{sem}}(I), E_{\text{text}}(C)) for post-hoc or in-loop checking.

All model parameters, including image/text encoders and denoiser, are frozen. The RAE head is trained once to invert the CLIP encoder, ensuring faithful reconstructions from the CLIP latent.

4. Practical Implementation: Video and Image Editing Scenarios

UniEdit-I generalizes across modalities:

Image Editing

For unified VLMs, UniEdit-I (Bai et al., 5 Aug 2025) demonstrates state-of-the-art closed-loop editing:

  • BLIP3-o-8B with a CLIP-ViT-B/32 vision tower (d=512d=512).
  • Diffusion steps T=30T=30, denoising via flow ODE.
  • Classifier-free guidance: source branch 2.0, target branch 5.5.
  • Feedback is computed every k=5k=5 steps, with adaptive gain and early stopping (st>0.85s_t > 0.85, pt>0.9p_t > 0.9, two consecutive checks).

Video Editing

Adapted to text-guided video motion and appearance editing (Bai et al., 2024), UniEdit-I employs an inversion-then-generation pipeline:

  • DDIM inversion retrieves latent zTz_T for the input video under the pre-trained diffusion model.
  • SA-S (spatial self-attention) and SA-T (temporal) features are extracted in parallel auxiliary branches (reconstruction for appearance/content, motion-reference for temporal dynamics).
  • Within the editing path, feature injection ensures scene consistency and text-guided motion without temporal drift.
  • No model weights are ever updated during editing; all operations are purely inference-time.

Core routines include:

  • Hard or soft feature swapping (with interpolation parameter α\alpha for content control).
  • Layer-specific and timestep-specific injection, e.g., from Ledit(z,C)=CLIPSim(z,Etext(C))\mathcal{L}_{\text{edit}}(z, C) = -\operatorname{CLIPSim}(z, E_{\text{text}}(C))0 and Ledit(z,C)=CLIPSim(z,Etext(C))\mathcal{L}_{\text{edit}}(z, C) = -\operatorname{CLIPSim}(z, E_{\text{text}}(C))1 onward for content, Ledit(z,C)=CLIPSim(z,Etext(C))\mathcal{L}_{\text{edit}}(z, C) = -\operatorname{CLIPSim}(z, E_{\text{text}}(C))2 for motion.

5. Empirical Results and Evaluation

On GEdit-Bench (606 samples, 11 edit types) (Bai et al., 5 Aug 2025):

  • Semantic Quality (SQ): 7.16 (UniEdit-I), higher than Step1X-Edit (7.09), BAGEL (7.36).
  • Perceptual Quality (PQ): 7.40 (UniEdit-I) vs. GPT-4o (7.62), Step1X-Edit (6.76).
  • Overall (O): 7.06 (UniEdit-I), approaching GPT-4o (7.53).
  • Artifact Score (CLIP latent): 8.10 Ledit(z,C)=CLIPSim(z,Etext(C))\mathcal{L}_{\text{edit}}(z, C) = -\operatorname{CLIPSim}(z, E_{\text{text}}(C))3 0.53 (vs. VAE baseline 5.35 Ledit(z,C)=CLIPSim(z,Etext(C))\mathcal{L}_{\text{edit}}(z, C) = -\operatorname{CLIPSim}(z, E_{\text{text}}(C))4 1.02).
  • Feedback Stability: 0.025 (CLIP) vs. 0.063 (VAE).

Convergence is efficient: 97.6% of samples reach optimality within the first iteration, mostly stopping around Ledit(z,C)=CLIPSim(z,Etext(C))\mathcal{L}_{\text{edit}}(z, C) = -\operatorname{CLIPSim}(z, E_{\text{text}}(C))5–Ledit(z,C)=CLIPSim(z,Etext(C))\mathcal{L}_{\text{edit}}(z, C) = -\operatorname{CLIPSim}(z, E_{\text{text}}(C))6.

For video editing, UniEdit-I outperforms previous approaches on automatic CLIP-based metrics and human MOS, with frame consistency (98.37) and textual alignment (36.29) exceeding all prior state-of-the-art baselines.

6. Significance, Limitations, and Future Directions

UniEdit-I establishes that:

  • High-level, semantics-driven editing in frozen, pretrained VLMs is feasible and effective.
  • Closed-loop, verification-based trajectories inherently suppress generative artifacts and semantic drift, outperforming open-loop or pixel-space approaches.

The absence of any model fine-tuning or architectural adapters is a distinguishing factor, suggesting a lower operational cost and greater deployability across pretrained VLMs.

This suggests that UniEdit-I frameworks could generalize to other modalities or even multi-hop reasoning scenarios by extending the integration of auxiliary feedback and domain-specific feature space diffusion.

Limitations include:

  • Dependency on the alignment and expressive power of the underlying semantic encoding (e.g., CLIP)—failure modes may occur for out-of-domain instructions or under-optimized vision-language adapters.
  • The RAE/decoder fidelity is limited by reconstruction quality; further improvements in semantic-to-pixel inversion would benefit output realism.

Continued research is anticipated on data-driven and generalization aspects (e.g., fully automated editing instruction generation and diversified motion/textual transformations), and on extending the closed-loop paradigm to additional multimodal tasks (Bai et al., 5 Aug 2025, Bai et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniEdit-I.