RefVIE: Reference Video Editing Dataset

Updated 4 July 2026

RefVIE is a dataset for reference-guided video editing that combines source videos, textual instructions, visual references, and target videos in quadruplets.
It employs a scalable pipeline—incorporating filtering, grounding, segmentation, and reference synthesis—to generate high-quality reference images for precise edit control.
RefVIE-Bench evaluates models on identity consistency, temporal coherence, and reference fidelity, driving advances in instruction–reference–guided video editing.

Searching arXiv for the primary paper and closely related datasets mentioned in the provided source. RefVIE is the data and evaluation backbone introduced in "Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance" for instruction–reference–following video editing, a setting in which a model receives a source video, a natural-language instruction, and a reference image, and must generate an edited video that conforms to both the textual edit specification and the visual details conveyed by the reference image. In the paper’s formulation, training data are quadruplets $(V_{src}, T_{inst}, I_{ref}, V_{tgt})$ , where $V_{src}$ is the source video, $T_{inst}$ the edit instruction, $I_{ref}$ the reference image, and $V_{tgt}$ the edited target video. RefVIE comprises two coupled resources: the RefVIE dataset, a large-scale open-source collection of 477K instruction–reference–video quadruplets, and RefVIE-Bench, a curated benchmark and evaluation protocol for measuring instruction following, reference fidelity, and temporal coherence in generated videos (Lin et al., 2 Mar 2026).

1. Conceptual scope and motivation

RefVIE was introduced to address a specific limitation of instruction-based video editing: natural language is described as “inherently ambiguous” for specifying exact visual details such as object identity, texture, style, and background appearance. The motivating use cases are edits of the form “replace the car with this sports car” or “apply the style of this painting,” where textual conditioning alone does not provide sufficient control. The paper’s central claim is that reference images can resolve this ambiguity by supplying explicit visual semantics unavailable in text-only conditioning (Lin et al., 2 Mar 2026).

A second motivation is data availability. Existing open datasets including InsViE, Señorita-2M, Ditto, ReCo, and OpenVE are described as instruction-based pairs without reference images, while reference-guided systems such as InstructX and Kling-Omni are said to rely on proprietary datasets unavailable to the community. RefVIE is therefore positioned as the first open dataset at scale that combines instruction editing and reference images in a single training resource.

Component	Role	Key facts
RefVIE	Training dataset	477K quadruplets
RefVIE-Bench	Evaluation benchmark	110 manually verified samples
Target task	Instruction–reference–following video editing	Inputs: $V_{src}$ , $T_{inst}$ , $I_{ref}$

The dataset is specifically designed for tasks where reference guidance is most salient and well-defined. The paper states that RefVIE covers local object edits and background replacement, and later characterizes its edit categories as local object addition, replacement, and background changes. This emphasis reflects a design choice: references are most useful when the edit concerns visually specific object appearance or scene backdrop rather than purely abstract transformations.

2. Construction pipeline from triplets to quadruplets

The technical contribution underlying RefVIE is a scalable pipeline that converts instruction-based video editing triplets $(V_{src}, T_{inst}, V_{tgt})$ into quadruplets $(V_{src}, T_{inst}, I_{ref}, V_{tgt})$ by synthesizing a reference image from the edited result. The pipeline begins from three open-source instruction-based video editing datasets: Ditto-1M, ReCo, and OpenVE-3M (Lin et al., 2 Mar 2026).

The first stage is source aggregation and filtering. EditScore is used as an automatic quality filter. For generic text-guided instruction tuning, only samples with EditScore $V_{src}$ 0 are retained. For reference-guided training, a stricter threshold of EditScore $V_{src}$ 1 is applied, and only samples labeled as Local Modification or Background Replacement are selected. The stated purpose is to reduce noise and ensure that the retained edits are meaningful and accurately executed.

The second stage is grounding and segmentation. Qwen3-VL-32B interprets the instruction $V_{src}$ 2 and localizes the edited region in the first frame of the target video $V_{src}$ 3. For Background Change, the foreground object is grounded so that it can be removed and the new background exposed. For Local Editing, the edited object is grounded so that it can be extracted as a reference. Qwen3-VL produces coarse bounding boxes, which are refined by SAM3 into “pixel-perfect segmentation masks.” Samples for which grounding or segmentation fails are discarded.

The third stage is reference image synthesis. Using the segmentation masks, the pipeline generates one reference image $V_{src}$ 4 per sample with Qwen-Image-Edit-2511. For background tasks, the foreground object is removed from the target frame and the exposed region is inpainted to produce a clean background image, which becomes the reference. For local edits, the segmented edited object is extracted from the target frame and placed on a clean, minimally cluttered background with tight cropping, producing an object-centric reference image. The pipeline also rejects references with “extreme aspect ratios or resolution” to maintain consistency.

The fourth stage is quality control and post-processing. A semantic alignment check is performed by an MLLM to verify that the synthesized reference image is semantically consistent with both the instruction and the target video. In addition, global de-duplication is carried out using CLIP features extracted from the reference images to remove redundant or near-identical references. The overall pipeline reduces an initial pool of 3.7M samples to 477K high-quality instruction–reference–video quadruplets.

3. Dataset composition and statistical profile

RefVIE is described as “the largest open-source collection for reference-guided video editing.” Its principal unit is the quadruplet $V_{src}$ 5, which directly supports supervised learning of mappings from source video, instruction, and reference image to edited video output (Lin et al., 2 Mar 2026).

The dataset is reported to be “well-balanced across local object addition, replacement, and background changes.” This balance matters methodologically because an overconcentration in a single edit class would bias a model toward a narrow operational regime. The paper explicitly ties the dataset to tasks requiring fine-grained control over object identity, style, and background appearance.

Clip duration is another defining property. The paper states that most clips contain 80 to 110 frames. That temporal range is presented as sufficient for long-range motion consistency and for handling complex object movements during editing and tracking. A plausible implication is that RefVIE was designed not merely for framewise appearance transfer, but for temporally extended edit propagation under motion and occlusion.

The dataset also emphasizes content diversity in the reference images. Figures described in the paper show substantial variation across objects such as vehicles and clothing, across textures, and across indoor, outdoor, natural, and urban backgrounds. Because RefVIE is derived from Ditto, ReCo, and OpenVE, its domains follow those source datasets—human activities, objects, and scenes—but are augmented with aligned visual references. This derived character is important for interpreting both its strengths and its inherited distributional biases.

4. RefVIE-Bench and its evaluation protocol

RefVIE-Bench is the evaluation counterpart to the training dataset. It consists of 110 manually verified samples defined by triplets $V_{src}$ 6; unlike the training data, no target video is provided. Models must generate the edited video, which is then scored (Lin et al., 2 Mar 2026).

The benchmark covers two task types. Subject Reference contains 70 samples focused on object or subject modification, including object replacement and insertion conditioned on a reference object. Background Replacement contains 40 samples in which only the background should change while foreground motion and identity are preserved. Each sample is said to pass “a rigorous three-stage manual verification process” for diversity and quality.

The paper explicitly rejects traditional metrics such as CLIP score and FID as sufficient evaluators for this setting. CLIP is described as measuring global semantics without detailed reference fidelity or instruction compliance, while FID is described as distributional and unsuited to per-sample assessment of precise edits. Instead, the benchmark uses Gemini 3, specifically Gemini 2.5 Pro in the reported experiments, as an automated MLLM-based judge with task-specific prompts.

For Subject Reference tasks, the judge scores generated videos on a 1–5 scale along three dimensions: Identity Consistency, which measures how well the edited object matches the reference in identity, texture, and style; Temporal Consistency, which measures stability in shape and texture across frames and the absence of flickering or deformation; and Physical Integration, which measures motion tracking, occlusion handling, shadows, and perspective alignment. For Background Replacement tasks, the judge scores Reference Fidelity, Matting Quality, and Visual Harmony, with definitions centered respectively on matching the target background structure and style while preserving the foreground, boundary quality and temporal stability around the composite, and lighting, perspective, and overall realism.

A hierarchical constraint is built into the prompts. In Subject Reference evaluation, Temporal Consistency and Physical Integration are capped by Identity Consistency. In Background Replacement evaluation, Matting Quality and Visual Harmony are capped by Reference Fidelity. The purpose is to prevent semantically incorrect but visually stable edits from receiving overly high scores. Final benchmark results are reported as average scores per dimension and as an overall average across dimensions and samples.

5. Task formulation and integration into Kiwi-Edit

RefVIE underpins a unified editing task with source video $V_{src}$ 7, instruction $V_{src}$ 8, and optional reference image $V_{src}$ 9 as input, and edited video $T_{inst}$ 0 as output. Within Kiwi-Edit, the underlying diffusion transformer is trained with Flow Matching, using the objective (Lin et al., 2 Mar 2026)

$T_{inst}$ 1

where $T_{inst}$ 2 is the latent of the target video $T_{inst}$ 3, $T_{inst}$ 4 is standard Gaussian noise, $T_{inst}$ 5 is the interpolated latent at time $T_{inst}$ 6, $T_{inst}$ 7 is the conditioning signal extracted from the MLLM and containing encoded $T_{inst}$ 8, $T_{inst}$ 9, and optionally $I_{ref}$ 0, and $I_{ref}$ 1 is the DiT velocity-field predictor. In this formulation, RefVIE supplies the reference-aware conditioning that makes learning from $I_{ref}$ 2 possible in an open-source setting.

The training curriculum has three stages. Stage 1, MLLM–DiT Alignment, trains LoRA adapters, Query/Latent connectors, and learnable query tokens while keeping the MLLM and DiT frozen; it uses only image editing triplets from GPT-Image-Edit and NHR-Edit. Stage 2, Instructional Tuning, unfreezes the DiT and trains on large-scale text-only image and video editing data, including filtered Ditto/ReCo/OpenVE triplets with EditScore $I_{ref}$ 3, under a resolution curriculum from 480p to 720p. Stage 3, Reference-Guided Fine-tuning, introduces RefVIE quadruplets and trains on mixed batches of image data, instruction-only video edits, and reference-guided video edits in a 2:1:1 ratio. The paper explicitly states that without RefVIE, Stage 3 would not be possible in an open-source setting.

Kiwi-Edit’s conditioning design is tightly coupled to RefVIE. The MLLM is Qwen2.5-VL-3B with a frozen base and LoRA adapters, and receives an interleaved sequence of source video frames, instruction text, and optional reference image. Two streams are extracted from the MLLM output: Instructional Queries, implemented by learnable query tokens whose size varies by task—256 for images, 512 for video editing, and 768 for reference tasks—and Reference Latents, consisting of dense visual tokens from the reference image. These are projected via a Query Connector and a separate Latent Connector, then concatenated into Context Tokens used as key/value inputs in DiT cross-attention layers.

The model also employs structural control via latent injection. Source video frames are encoded through the VAE, passed through a source patch embedding layer, and added to the noisy latent according to

$I_{ref}$ 4

For reference images, patch-embedded reference tokens are concatenated directly to the DiT input sequence, extending the attention window so the model can directly attend to reference textures. The paper states that both the Latent Connector and this reference-token concatenation are only exercised in Stage 3 with RefVIE data.

6. Empirical findings, limitations, and research implications

On RefVIE-Bench, the reported overall scores are 3.29 for Runway Aleph, 3.99 for Kling-O1, 2.96 for Kiwi-Edit trained on all data, and 3.31 for Kiwi-Edit fine-tuned with reference data only. For Subject Reference, the RefVIE-driven “Ref. data only” setting reaches Identity Consistency 3.98, Temporal Consistency 3.40, and Physical Consistency 3.34. For Background Replacement, it reaches Reference Similarity 3.72, Matting Quality 2.90, and Video Quality 2.51 (Lin et al., 2 Mar 2026).

These results are used to support two claims. First, dedicated reference-guided fine-tuning substantially improves reference-related metrics relative to the broader “All data” mixture, indicating that RefVIE provides signal not recovered by instruction-only supervision. Second, while Kling-O1 remains higher overall, Kiwi-Edit with RefVIE is presented as competitive with a strong commercial baseline, Runway Aleph, while remaining fully open. Ablation results further isolate the contribution of RefVIE-compatible architectural components: adding Reference Latents on top of query-based conditioning improves the subject editing score from 3.20 to 3.30.

Qualitative results are described as showing precise localization and modification of objects according to both text and reference, preservation of subject identity under strong background style changes, and better maintenance of identity, texture, and motion in reference-guided object replacement. These observations are consistent with the benchmark’s emphasis on identity fidelity and physical integration, although the paper’s qualitative claims remain tied to the examples shown in its figures.

The paper also identifies several limitations. The references are synthetic, generated by a pipeline using Qwen-Image-Edit-2511 and segmentation; as a result, they are often canonical clean crops or inpainted backgrounds and may differ from real user-supplied photos. RefVIE focuses on local edits and background replacement rather than motion retargeting or more complex global style sequences. Its domain distribution is inherited from Ditto, ReCo, and OpenVE, and videos are mostly 80–110 frames with training up to 720p. The paper does not provide an extensive bias analysis, but notes the general risks associated with Internet-scale media and implies misuse concerns such as realistic subject swapping.

The work also implies several future extensions. These include expanding RefVIE to additional edit types such as motion edits and attribute manipulation across time, supporting more varied reference forms such as multi-view reference videos or multi-image references, improving the realism and diversity of synthesized reference scaffolds, incorporating real user-provided references in future collections, and scaling RefVIE-Bench with more samples, more categories, and possibly human studies alongside MLLM-based scoring. This suggests that RefVIE is best understood not only as a static dataset, but as a formalization of reference-aware video editing supervision and evaluation in an open-source research setting.

Markdown Report Issue Upgrade to Chat

References (1)

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RefVIE.