IV-Edit: Instruction-Guided Visual Editing

Updated 4 July 2026

IV-Edit is a family of methods that transform pre-existing images or videos via explicit instructions while maintaining untargeted content.
Researchers employ diverse architectures—such as closed-loop editing in semantic latent space and inversion-based techniques—to guarantee fidelity and controllability.
The framework is evaluated through novel benchmarks and iterative processes, highlighting trade-offs between edit strength and source preservation.

IV-Edit is not presented as a single formally standardized framework in the cited literature. A plausible synthesis is that the term denotes an edit-centric family of methods in which an existing source image, video, or other structured input is transformed under an instruction, target condition, or explicit edit budget, rather than generated from scratch. In recent arXiv work, this family spans instruction-based video editing, training-free closed-loop image editing in semantic latent space, few-step inversion-based real-image editing, identity-preserving editable customization, and benchmark design for fine-grained edit evaluation (Zhang et al., 22 Mar 2025, Bai et al., 5 Aug 2025, Gong et al., 8 Aug 2025, Li et al., 6 Sep 2025, Basu et al., 2023).

1. Scope and terminology

Within image and video generation, IV-Edit most naturally refers to methods that begin from a source visual artifact and then apply a requested change while preserving untargeted content. InstructVEdit formulates instruction-guided video editing as taking a source video $V_s$ and a human instruction $T_h$ to generate an edited target video $V_t$ , spanning local edits such as object replacement and attribute changes, and global edits such as style transformation and background replacement, while preserving temporal consistency across frames (Zhang et al., 22 Mar 2025). UniEdit-I addresses general text-guided image editing: given a source image $I_{\text{src}}$ and a natural-language editing instruction $q$ , the goal is to produce an edited output $I_{\text{out}}$ that satisfies the instruction while preserving irrelevant source content (Bai et al., 5 Aug 2025). InstantEdit studies text-guided editing of real images with a source image $x$ , a source prompt $c$ , and a target prompt $\hat c$ , under the standard inversion-and-regeneration paradigm (Gong et al., 8 Aug 2025). EditIDv2 addresses identity-preserving, editable text-to-image customization, where a reference identity image and a text prompt must jointly preserve the target identity and follow semantic edits, especially under long prompts and high-complexity narrative scenes (Li et al., 6 Sep 2025).

This usage is narrower than unrestricted generation and broader than single-shot retouching. The common denominator is explicit concern with editability, preservation, and controllability. UniEdit-I is particularly explicit about a terminological ambiguity: if IV-Edit means human-in-the-loop stepwise editing, UniEdit-I is not that; if it means iterative visual editing with feedback, it fits well because its loop is autonomous and self-corrective rather than conversational (Bai et al., 5 Aug 2025).

Work	Domain	Core formulation
EditVal	Text-guided image editing	Standardized benchmark on real images across 13 edit types
InstructVEdit	Instruction-based video editing	Source video plus human instruction to edited target video
UniEdit-I	Text-guided image editing	Training-free, closed-loop editing in semantic latent space
InstantEdit	Real-image editing	Few-step inversion-and-regeneration with RectifiedFlow
EditIDv2	ID customization	Reference identity image plus text prompt

2. Data, benchmarks, and supervision

A central theme in IV-Edit research is that supervision is scarce and heterogeneous. InstructVEdit states that the central bottleneck is the scarcity of large-scale, high-quality paired video editing data, and therefore builds a four-step curation workflow. It starts from IP2P, trains on UltraEdit and P2P, generates first-frame editing triples with 30 sampling iterations, 20 diffusion steps, text guidance from 1.2 to 2.0, and image guidance from 5.0 to 12.5, stores the editing trajectory of the first frame, expands the source image into a 48-frame source video with CogVideoX-5B-I2V, and then expands the first-frame edit into a target video with FFG-V2V. In the experimental pipeline, 50K UltraEdit images are filtered to 25K high-quality images, then expanded and filtered to 9K video pairs (Zhang et al., 22 Mar 2025).

EditVal addresses a different but complementary supervision problem: the absence of a standardized evaluation protocol for diffusion-based text-guided image editing. It uses MS-COCO, with 19 MS-COCO object classes, 92 curated images, and 648 unique image-edit operations. Its 13 edit types are: object-addition, object-replacement, positional addition, size, position-replacement, alter-parts, background, texture, style, color, shape, action, and viewpoint. The benchmark is modular, and new edit types can be added by extending the JSON metadata (Basu et al., 2023).

EditIDv2 uses what it calls minimal data lubrication: roughly 3K labeled samples from MyStyle and publicly crawled internet data, labeled with GPT-4o. The claimed purpose is not large identity coverage but many different attributes under the same identity, including different facial poses, lighting, accessories, and expressions, so that the model learns when to preserve identity and when to let the prompt control the scene (Li et al., 6 Sep 2025). This suggests a broader IV-Edit pattern: supervision quality is often treated as more decisive than raw corpus size.

3. Architectural mechanisms

IV-Edit methods differ sharply in where they place the edit operator. InstructVEdit begins from an IP2P-style diffusion image editing model and adapts it into a video editor with two dedicated modules: Soft Motion Adapter (SMA) and Editing-guided Propagation Module (EPM). SMA introduces a learnable scalar $\alpha$ that modulates the temporal residual,

$T_h$ 0

with $T_h$ 1 at initialization so that motion priors are integrated gradually. EPM predicts frame importance and uses it as an attention bias so that frames with stronger edits exert higher influence across the sequence (Zhang et al., 22 Mar 2025).

UniEdit-I places editing directly in the semantic latent space of a unified VLM, specifically BLIP3-o-8B, rather than in pixel space or a conventional VAE latent space. Its Understanding–Editing–Verifying (UEV) loop constructs a structured source caption $T_h$ 2, a scene graph $T_h$ 3, a target caption $T_h$ 4, and a modified target graph $T_h$ 5, then performs diffusion-based editing in CLIP-like feature space and verifies intermediate states with the frozen VLM. The adaptive gain is

$T_h$ 6

with $T_h$ 7 and $T_h$ 8, so the update magnitude depends on semantic progress and task completion (Bai et al., 5 Aug 2025).

InstantEdit operates in a different regime: few-step, inversion-based editing on RectifiedFlow, specifically PeRFlow. Its inversion is Piecewise Rectified Flow Inversion (PerRFI), which runs the flow dynamics in reverse, and its regeneration mechanism is Inversion Latent Injection (ILI), which reuses the entire inverted latent trajectory rather than only the final latent. The guidance rule is Disentangled Prompt Guidance (DPG), which removes the source-aligned component from the edit-driving signal and can be combined with an attention-based mask. A Canny-conditioned ControlNet is added as a structural prior (Gong et al., 8 Aug 2025).

EditIDv2 concentrates the problem in the ID feature integration module. It decomposes a PerceiverAttention-like cross-attention bridge between identity and generation branches, defines a compensated query

$T_h$ 9

and makes $V_t$ 0 step-dependent through

$V_t$ 1

It also introduces an identity loss

$V_t$ 2

and combines it with diffusion loss through

$V_t$ 3

with $V_t$ 4 in experiments (Li et al., 6 Sep 2025).

4. Iteration, inversion, and control

A defining property of IV-Edit systems is that the edit trajectory itself becomes an object of control. UniEdit-I is the clearest closed-loop example. Every $V_t$ 5 diffusion steps, the current latent $V_t$ 6 is decoded to an image $V_t$ 7, the frozen VLM computes a global alignment score $V_t$ 8 and a task-completion score $V_t$ 9, and the system adjusts $I_{\text{src}}$ 0, optionally stops early, or restarts from the best intermediate latent $I_{\text{src}}$ 1. Early stopping occurs if

$I_{\text{src}}$ 2

for two consecutive verification points, and the paper reports a maximum of 3 UEV iterations (Bai et al., 5 Aug 2025).

InstantEdit uses iteration differently. Its emphasis is not semantic self-verification but preserving source fidelity under an extreme step budget. The method is designed for 4–8 NFE, with strong reported results at 8 NFE, and treats the stored inversion path as reusable structure. The central claim is that RectifiedFlow’s straight trajectories make inversion much more accurate in low-step settings than DDIM inversion, which in turn stabilizes regeneration (Gong et al., 8 Aug 2025).

InstructVEdit combines architectural iteration with dataset iteration. After Stage I synthetic initialization on 9K video pairs, it performs two rounds of real-world iterative refinement using videos collected from YouTube-VOS, MOSE, LVOS, and MeViS. Approximately 8K real-world videos are standardized into 16K 16-frame clips; Stage II produces 3,402 training clips and Stage III produces 5,048 clips (Zhang et al., 22 Mar 2025). This suggests a broader IV-Edit distinction between methods that refine the latent trajectory at inference and methods that refine the training distribution across stages.

5. Evaluation and empirical profile

Evaluation in IV-Edit is explicitly multidimensional. EditVal separates edit accuracy, DINO similarity, and FID, and complements automatic metrics with a standardized human study. Its automated evaluator uses OwL-ViT with threshold 0.1 for six object-centric edit types: object-addition, object-replacement, positional-addition, size, position-replacement, and alter-parts. Human evaluation is conducted on Amazon Mechanical Turk, with 3 unique workers per task, a 0, 1, 2, 3 scale, a gold set of 150 tasks, and 76–78% majority agreement. Across both automated and human evaluation, Instruct-Pix2Pix, Null-Text Inversion, and SINE are the top-performing methods averaged across different edit types, but only Instruct-Pix2Pix and Null-Text are able to preserve original image properties; position-replacement is especially difficult, with automated accuracies often 0–15% (Basu et al., 2023).

InstructVEdit is evaluated on LOVEU-TGVE-2023 (TGVE) and TGVE+ using ViCLIP $I_{\text{src}}$ 3, ViCLIP $I_{\text{src}}$ 4, PickScore, CLIPFrame, and CLIPText. On TGVE, the reported numbers are 0.280, 0.237, 20.92, 0.919, and 27.69; on TGVE+, they are 0.271, 0.183, 20.94, 0.917, and 26.65. The paper notes that CLIPFrame is slightly below some training-free methods because CLIPFrame favors minimal changes (Zhang et al., 22 Mar 2025).

UniEdit-I reports results on GEdit-Bench-EN, a 606-sample benchmark with 11 task categories, using VIEScore with G_SC, G_PQ, and G_O. On the full set, UniEdit-I reports 7.16 / 7.40 / 7.06, and in the convergence analysis 97.6% finish in the first iteration, 2.5% require one refinement iteration, and overall 100% convergence is reached by the stated maximum iterations (Bai et al., 5 Aug 2025).

InstantEdit evaluates on PIE Bench, which covers 9 edit types, and reports structure distance, PSNR, LPIPS, MSE, SSIM, CLIPScore on the whole image, CLIPScore on the edited region, and efficiency. In the few-step setting, the reported results are 17.14, 27.96, 44.39, 34.94, 86.44, 26.28, 22.82, 1.37 s, 8 NFE, and 4 steps. The user study reports 37.43% for InstantEdit, compared with 35.05% for TurboEdit, 18.35% for InfEdit, and 9.17% for ReNoise (Gong et al., 8 Aug 2025).

EditIDv2 uses IBench, specifically ChineseID + editable long prompts, and reports Aesthetic 0.691, Image Quality 0.437, Facesim 0.659, ClipI 0.804, ClipT 0.253, Yaw 18.17, Pitch 9.641, Roll 11.39, Landmarkdiff 0.096, and Exprdiv 0.611 (Li et al., 6 Sep 2025).

Work	Benchmark	Representative reported results
EditVal	92 curated images, 648 operations, 13 edit types	Position-replacement often 0–15%; object-addition about 35%–55%
InstructVEdit	TGVE / TGVE+	TGVE: 0.280 ViCLIP $I_{\text{src}}$ 5, 0.237 ViCLIP $I_{\text{src}}$ 6, 20.92 PickScore
UniEdit-I	GEdit-Bench-EN	7.16 / 7.40 / 7.06 on G_SC / G_PQ / G_O
InstantEdit	PIE Bench	17.14 structure distance, 27.96 PSNR, 22.82 edited-region CLIPScore
EditIDv2	IBench	0.691 Aesthetic, 0.659 Facesim, 0.804 ClipI

Although IV-Edit is primarily grounded in visual editing in the cited material, adjacent work shows that the same edit-refinement logic extends beyond vision. In non-autoregressive ASR, “CTC-Seeded Token Edit Refinement” reformulates decoding as variable-length correction of an already informative seed hypothesis: the collapsed greedy CTC output. The acoustic-conditioned Edit Flow decoder predicts insertions, deletions, and substitutions in parallel, is trained jointly with a CTC model using a continuous-time discrete diffusion loss, and finds that two edit steps work best. With classifier-free guidance at $I_{\text{src}}$ 7, dev WER improves from 2.7/6.8 to 2.3/5.5 on the ESPNet encoder and from 2.3/5.6 to 2.1/4.8 on Whisper Base (Huang et al., 27 Jun 2026).

In computational geometry, “Fréchet Edit Distance” defines the minimum number of vertex edits required so that an edited polygonal curve is within a threshold $I_{\text{src}}$ 8 of another curve under continuous or discrete Fréchet distance. The paper gives polynomial-time algorithms for strong variants and NP-hardness for weak variants. This is not a visual generation method, but it formalizes the same structural idea: an edit operation is meaningful only relative to a target-constrained similarity criterion (Fox et al., 2024).

A plausible implication is that IV-Edit can be interpreted not only as a family of image and video systems, but as a broader algorithmic pattern in which a good initial hypothesis is iteratively corrected by a restricted edit operator under domain-specific fidelity constraints.

7. Limitations and unresolved trade-offs

Across the visual literature, the most persistent failure mode is the trade-off between applying the requested edit and preserving source content. EditVal makes this explicit by separating edit success / fidelity, object preservation, and context preservation; a method can score well on edit success but poorly on preservation, or vice versa. The benchmark further concludes that most of the editing methods fail at edits involving spatial operations, and that there is no `winner' method that ranks best individually across a range of different edit types (Basu et al., 2023).

InstructVEdit inherits the standard tension between edit strength and static consistency. Its ablations state that SMA improves editing fidelity but can hurt CLIPFrame slightly, while EPM improves edit fidelity and consistency, and CLIPFrame remains lower than some training-free methods even when the full model is strongest overall (Zhang et al., 22 Mar 2025). UniEdit-I reports that text change is notably weak and attributes this to limitations in the underlying pretrained VLM semantics rather than the editing loop itself; the paper also notes that rare concepts and fine-grained attributes remain difficult (Bai et al., 5 Aug 2025). InstantEdit states that it is mainly effective for moderate edits and struggles with large structural changes, especially pose modification or other geometrically demanding transformations (Gong et al., 8 Aug 2025). EditIDv2 improves long-prompt editability, but its own framing implies that stronger identity consistency can still reduce edit freedom, while weaker identity conditioning can destroy subject consistency (Li et al., 6 Sep 2025).

Evaluation itself remains incomplete. EditVal explicitly avoids automated scoring for edits such as viewpoint or action because vision-LLMs do not reliably recognize those concepts (Basu et al., 2023). This suggests that IV-Edit still lacks a universal metric regime spanning object-centric edits, spatial reasoning, narrative identity preservation, and temporal coherence. The field therefore remains organized around benchmark-specific decompositions rather than a single, settled objective.