Papers
Topics
Authors
Recent
Search
2000 character limit reached

PISCO-Bench: Sparse Video Insertion Benchmark

Updated 4 July 2026
  • PISCO-Bench is a benchmark for precise video instance insertion under sparse control, using paired videos and verified instance annotations.
  • It challenges methods to insert specific instances while preserving spatial placement, coherent motion, and realistic scene interactions like occlusions and shadows.
  • Evaluation protocols such as first-frame and first+last controls quantitatively assess metrics including FVD, LPIPS, PSNR, and SSIM.

Searching arXiv for the benchmark-defining paper and closely related records. arXiv search query: ti:"PISCO: Precise Video Instance Insertion with Sparse Control" OR ([2602.08277](/papers/2602.08277)) PISCO-Bench is the evaluation benchmark introduced alongside “PISCO: Precise Video Instance Insertion with Sparse Control” to measure precise video instance insertion under sparse user control rather than generic video editing quality (Gao et al., 9 Feb 2026). In this setting, a method receives a real video of a scene without an object, together with a small number of user-provided instance keyframes and associated instance-side signals, and must insert that specific instance into the clean video at the intended place and time while preserving original background dynamics and producing plausible interactions such as occlusions, shadows, reflections, or illumination adaptation. The benchmark is therefore centered on instance-level controllability with minimal user effort, especially sparse temporal supervision rather than dense per-frame annotation. A recurring source of confusion is nomenclature: the 2018 astrophysical dataset paper titled “PISCO” does not define or name a benchmark called “PISCO-Bench” (Galbany et al., 2018).

1. Scope and task definition

PISCO-Bench is designed around a narrower and harder problem than generic editing, text-driven modification, or inpainting. Its target task is not arbitrary scene transformation, but insertion of a specific instance into an existing video under sparse controls. The paper frames this as requiring precise spatial-temporal placement, physically consistent scene interaction, and the faithful preservation of original dynamics, all under minimal user effort (Gao et al., 9 Feb 2026).

The surrounding task formulation uses paired videos {V^,V}\{\hat{V}, V\} of length TT, where V^={V^t}t=1T\hat{V}=\{\hat{V}_t\}_{t=1}^{T} contains the foreground instance and V={Vt}t=1TV=\{V_t\}_{t=1}^{T} is the same scene without that instance. The objective is to synthesize an edited video V~\tilde V that inserts the instance naturally into VV, with correct spatial placement, temporally coherent motion, and plausible occlusion and depth relations. Instance-side signals are represented as RGB clip I={It}t=1TI=\{I_t\}_{t=1}^{T}, mask M={Mt}t=1TM=\{M_t\}_{t=1}^{T}, and instance depth DI={DI,t}t=1TD_I=\{D_{I,t}\}_{t=1}^{T}, together with background depth DVD_V. Sparse user control is encoded by an availability mask TT0, where TT1 indicates whether instance information is provided at time TT2, and only instance-side signals are masked: TT3

This formalization matters because PISCO-Bench is built specifically to expose failure modes induced by sparse conditioning. Whole-sequence supervision is absent by design; the method must infer motion propagation, appearance continuity, and scene adaptation between a small number of anchor frames. This suggests that the benchmark is fundamentally about controllability under underdetermined temporal constraints, not merely about perceptual realism.

2. Construction and annotation design

PISCO-Bench is built on top of BURST. The authors “carefully select 100 videos” from BURST, with broad scenario coverage and no overlap with training data (Gao et al., 9 Feb 2026). The benchmark is intended to cover “diverse real-world conditions,” explicitly including urban driving scenes, scripted movie clips, and in-the-wild internet videos.

A key step in construction is annotation verification. For each selected video, the authors manually inspect BURST’s instance segmentation annotations and correct missing or low-quality masks so that conditioning inputs are reliable. In the paper’s terminology, this is what underlies the phrase “verified instance annotations.” The text does not provide the number of corrected instances, the number of categories, or a more detailed workforce or protocol description beyond manual inspection and correction.

To create the paired clean-background videos needed for insertion evaluation, the target instances are removed using ROSE, described as a “side-effect-aware instance removal model.” Each benchmark sample therefore contains a target video TT4, a corresponding clean video TT5, verified or corrected masks TT6, and extractable segmented instance cutouts from TT7. This paired construction is the main structural distinction between PISCO-Bench and typical editing benchmarks: it supplies both a clean background video and a ground-truth inserted version of the same scene.

The paper does not define train, validation, and test splits for PISCO-Bench. It presents the resource as an evaluation benchmark, not as a trainable benchmark with official splits. It also does not report several dataset statistics that would often accompany a benchmark release: there is no explicit number of clips per video, no total instance count, no category count, no duration statistics, no frame-rate distribution, no occlusion statistics, no camera-motion taxonomy, and no official split breakdown. Since evaluation is standardized later at 49 frames and TT8 resolution for method comparison, that constitutes the effective experimental protocol, but not a description of native BURST video properties.

3. Sparse control protocols

PISCO-Bench evaluates methods under sparse-control instance insertion settings. The main benchmark protocol uses two quantitative settings: First-frame control and First-and-last-frame control (Gao et al., 9 Feb 2026).

In First-frame control, the segmented instance image and mask are provided only at the first frame. This is the most ambiguous regime, because the method must extrapolate the full trajectory, appearance evolution, and scene interaction from a single initial condition. In First-and-last-frame control, the segmented instance image and mask are provided at both temporal endpoints. This reduces drift and constrains both motion and appearance over the full sequence.

For scalability analysis, the paper additionally evaluates a Five-Frame setting for PISCO only: the first, last, and three random intermediate frames are provided. More generally, the method is designed to accommodate “a single keyframe, start-and-end keyframes, or sparse keyframes at arbitrary timestamps,” but only the first-only, first+last, and five-frame variants are quantitatively reported on the benchmark.

Under these settings, the evaluated method must insert the specified instance at the correct place and time indicated by the sparse controls, preserve the identity and appearance of that instance, propagate motion plausibly between controls, preserve the original background dynamics, and adapt the scene to insertion-induced effects. A major design point of PISCO-Bench is therefore not just absolute quality, but graceful scaling with additional sparse control. The benchmark is meant to test whether performance improves as sparse supervision becomes less ambiguous.

4. Evaluation methodology

PISCO-Bench uses both reference-based and reference-free evaluation (Gao et al., 9 Feb 2026). The paired clean and target videos TT9 make reference-based assessment possible in a way that most video editing benchmarks cannot support.

In the reference-based setting, the generated insertion result V^={V^t}t=1T\hat{V}=\{\hat{V}_t\}_{t=1}^{T}0 is compared directly against the target video V^={V^t}t=1T\hat{V}=\{\hat{V}_t\}_{t=1}^{T}1 using FVD, LPIPS, PSNR, and SSIM. Evaluation is performed at two levels. The first is whole-video assessment, comparing V^={V^t}t=1T\hat{V}=\{\hat{V}_t\}_{t=1}^{T}2 and V^={V^t}t=1T\hat{V}=\{\hat{V}_t\}_{t=1}^{T}3 over the entire sequence. This captures global fidelity, including whether the original scene is preserved and whether insertion-induced physical effects such as shadows or reflections appear plausibly in the full frame. The second is foreground assessment, isolating the inserted instance with mask V^={V^t}t=1T\hat{V}=\{\hat{V}_t\}_{t=1}^{T}4 and computing

V^={V^t}t=1T\hat{V}=\{\hat{V}_t\}_{t=1}^{T}5

This isolates controllability and identity preservation, since whole-frame scores can conceal severe errors on the inserted object.

The benchmark also incorporates VBench for reference-free perceptual evaluation. Because standard VBench consistency metrics operate on whole frames, the authors adapt them using instance masks to isolate foreground and background regions, then compute CLIP- and DINO-feature-based consistency separately on those masked regions. The reported VBench metrics are Background Consistency, Subject Consistency, Aesthetic Quality, Imaging Quality, Motion Smoothness, Overall Consistency, Temporal Flickering, Temporal Style, and Average. Background and Subject Consistency are specifically computed in masked regions; other VBench metrics follow the official implementation.

This two-track design reflects the benchmark’s task definition. Reference-based metrics test fidelity to a known target insertion, while masked VBench scores separate background preservation from inserted-subject consistency. That separation is particularly important for instance insertion, where a method can preserve the scene while failing on the inserted object, or vice versa.

5. Baselines and empirical findings

Because there are no open-source methods with exactly the same sparse-conditioning interface, the paper compares PISCO against approximating baselines from three categories: an agentic pipeline combining image editing and image-to-video generation, video inpainting baselines, and reference-guided video-to-video editing baselines (Gao et al., 9 Feb 2026). The agentic pipeline uses Nano-banana-Pro for image editing and Wan2.2-Fun-A14B-InP for I2V generation. The inpainting baselines are CoCoCo and VideoPainter, with prompts produced by Qwen3-VL-32B-Instruct from the segmented reference instance image. The reference-guided editing baselines are VACE and UniVideo.

The comparison is explicitly imperfect. Several baselines require dense masks and/or text prompts, whereas PISCO operates with only sparse masks and sparse instance frames. Experimental comparisons are normalized at 49 frames, V^={V^t}t=1T\hat{V}=\{\hat{V}_t\}_{t=1}^{T}6, with 50 diffusion denoising steps.

The quantitative results establish a consistent trend. On whole-video reference-based metrics, PISCO-14B First+Last is best overall on FVD, LPIPS, and SSIM, with FVD 204, LPIPS 0.097, PSNR 26.58, and SSIM 0.89. Relative to the strongest baseline VACE-14B, this improves whole-video FVD from 371 to 204 and LPIPS from 0.103 to 0.097. On foreground metrics, gains are especially pronounced: VACE-14B achieves FVD 273, LPIPS 0.028, PSNR 30.55, and SSIM 0.98, whereas PISCO-14B First+Last achieves FVD 138, LPIPS 0.022, PSNR 33.58, and SSIM 0.98.

The paper explicitly emphasizes that endpoint constraints improve dynamics: first+last is consistently better than first-only. The Five-Frame setting strengthens this pattern. For PISCO-14B, five frames improve whole-video FVD to 136 and foreground LPIPS to 0.015. In reference-free VBench evaluation, PISCO-14B First+Last attains the best reported Average score at 65.64, with Subject Consistency 91.57, Aesthetic Quality 50.08, Imaging Quality 62.00, and Overall Consistency 15.64. The five-frame condition improves further, with Subject Consistency 91.98 and Aesthetic Quality 51.45.

Qualitatively, the paper attributes baseline failure modes to their differing interfaces. Agentic pipelines suffer severe background hallucination because they regenerate the whole sequence. Inpainting baselines often hallucinate blurry or wrong objects because they lack direct reference-instance inputs. UniVideo and VACE are described as struggling with long-term spatial control, including incorrect scale or loss of the target object later in the sequence. By contrast, PISCO, especially in first+last mode, is reported to provide better spatiotemporal alignment, preserve scene dynamics, and respect the target trajectory.

6. Methodological context and benchmark-specific caveats

PISCO-Bench exists because prior benchmarks do not isolate the paper’s central challenge: precise insertion of a specific instance into an existing video under sparse controls (Gao et al., 9 Feb 2026). Generic editing benchmarks emphasize broad prompt-driven modification, while inpainting benchmarks usually assume dense masks. Neither directly tests whether a method can preserve the original scene while propagating a sparsely specified inserted object with correct identity, motion, and interactions.

To understand the benchmark’s intended failure modes, three method components in the surrounding paper are especially relevant. Variable-Information Guidance (VIG) trains across a spectrum of sparse-conditioning regimes by sampling availability masks V^={V^t}t=1T\hat{V}=\{\hat{V}_t\}_{t=1}^{T}7, with objective

V^={V^t}t=1T\hat{V}=\{\hat{V}_t\}_{t=1}^{T}8

Distribution-Preserving Temporal Masking (DPTM) addresses the distribution shift caused by sparse temporal conditioning in pretrained temporal VAEs, using pixel-space temporal completion, token-space masking, and an availability channel aligned to compressed token resolution. Geometry-aware conditioning uses both background depth V^={V^t}t=1T\hat{V}=\{\hat{V}_t\}_{t=1}^{T}9 and instance depth V={Vt}t=1TV=\{V_t\}_{t=1}^{T}0, alongside RGB, masks, and availability, to support depth ordering and occlusion handling. These are not benchmark mechanics per se, but they explain why PISCO-Bench is a strong stress test for sparse-conditioning robustness.

Several caveats follow from the paper’s own presentation. PISCO-Bench is curated rather than large-scale: it contains 100 BURST-derived videos. Clean background videos are produced by an instance removal model (ROSE) rather than by real paired capture. The benchmark is also underspecified in several descriptive respects, since category count, split definition, duration statistics, and exact instance counts are not reported. A reasonable inference is that removal artifacts or removal biases could influence evaluation, because the “clean” videos are generated rather than directly observed. Another reasonable inference is that the benchmark may be biased toward instances that are sufficiently segmentable and removable, because both verified masks and clean-background construction depend on those operations.

The paper also does not report a user study. Evaluation therefore leans heavily on automated metrics. Finally, the comparison protocol is fixed to 49 frames at V={Vt}t=1TV=\{V_t\}_{t=1}^{T}1, so the benchmark as used in the paper does not probe long-form insertion behavior beyond that window, even though the broader PISCO system is later extended to 120 frames at 720p. This suggests that PISCO-Bench should be understood as a targeted benchmark for sparse controllability and paired evaluation, not as a comprehensive benchmark for every form of video compositing or open-ended video editing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PISCO-Bench.