Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

Published 12 Apr 2026 in cs.CV | (2604.10837v1)

Abstract: Image-to-video (I2V) generation has the potential for societal harm because it enables the unauthorized animation of static images to create realistic deepfakes. While existing defenses effectively protect against static image manipulation, extending these to I2V generation remains underexplored and non-trivial. In this paper, we systematically analyze why modern I2V models are highly robust against naive image-level adversarial attacks (i.e., immunization). We observe that the video encoding process rapidly dilutes the adversarial noise across future frames, and the continuous text-conditioned guidance actively overrides the intended disruptive effect of the immunization. Building on these findings, we propose the Immune2V framework which enforces temporally balanced latent divergence at the encoder level to prevent signal dilution, and aligns intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override. Extensive experiments demonstrate that Immune2V produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a dual-objective immunization method that targets both spatial-temporal and semantic streams to induce persistent collapse in video synthesis.
It employs a temporally balanced loss in the 3D VAE and manipulates semantic representations to override prompt-based alignment, ensuring robust defense.
Experimental results demonstrate significant degradation in video quality and prompt fidelity compared to prior image-only defenses.

Immune2V: Imperceptible Image Immunization Against Dual-Stream Image-to-Video Generation

Motivation and Problem Analysis

The proliferation of high-fidelity I2V diffusion models enables arbitrary, realistic animation of static images using paired text-prompt conditioning—a double-edged advancement that raises privacy and misuse concerns, especially in the generation of deepfakes. While input-level adversarial perturbations, or “immunization,” have demonstrated efficacy for image-level editing protection, the direct extension of these defenses to I2V pipelines has not achieved persistent or effective results. The paper “Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation” (2604.10837) interrogates why this is the case by analyzing the architectural and optimization characteristics of modern dual-stream I2V frameworks.

A central finding is that dual-stream I2V architectures—exemplified by recent SOTA models—instantiate both a spatial-temporal pathway (typically a causal 3D VAE) and a semantic conditioning pathway (e.g., via CLIP/T5-based cross-attention) (Figure 1). This composition introduces two defense-resilient bottlenecks: (1) Perturbation attenuation in the spatial-temporal stream, where adversarial signals affecting only the input frame rapidly dissipate and fail to persist across the generated temporal sequence; and (2) Semantic conditioning override, where continuous text-prompt matching actively “heals” the effects of input perturbations, realigning generative trajectories regardless of input-level noise.

Figure 1: The dual-stream I2V architecture processes an image via spatial-temporal and semantic streams; immunization must target both to induce persistent collapse.

Empirical observations based on gradient analysis and latent distance metrics underscore the magnitude of spatiotemporal attenuation—the influence and optimization signal for later frames diminishes almost to zero if only the image’s first frame is perturbed. Similarly, latent trajectory tracking in the denoising process shows that when a semantically aligned prompt is used, semantic guidance forcibly realigns generative outcomes even if the spatial stream is compromised (Figure 2).

Figure 2: Semantic conditioning can override encoder-level corruption, but Immune2V’s joint optimization dampens semantic realignment.

Methodology: Dual-Stream Joint Optimization with Immune2V

To circumvent the failings of spatial-only or semantic-only immunization, Immune2V introduces a dual-objective optimization targeting both the spatial-temporal encoder and the conditional semantic stream (Figure 3).

Spatial-Temporal Attack: Temporal attenuation is countered by introducing a temporally balanced encoder-level loss. Instead of aligning only the initial frame’s encoder output with a corrupted target, the framework enforces deviation at every temporal slice of the latent pseudo-video. The target sequence is constructed by replicating a selected collapse-inducing image across all time steps, feeding it into the 3D VAE so that each latent slice becomes corrupted independently, with corresponding loss terms providing persistent gradient signals throughout the sequence.
Semantic Attack: To neutralize prompt-based overriding, the semantic conditioning pathway is manipulated by precomputing a collapse trajectory (generated by running the clean image with an intentionally mismatched “bad” prompt) and enforcing the adversarial image’s intermediate cross-attention representations (when paired with a correct prompt) to mimic this degradation. This process requires extracting latent representations from the semantic transformer backbone at selected denoising steps, adding a loss that aligns them with the collapse-inducing trajectory irrespective of prompt coherence.
Joint Objective and Optimization: The overall adversarial signal is the weighted sum of the spatial-temporal and semantic objectives, with the image-space perturbation constrained under an $L_\infty$ norm to maintain perceptual imperceptibility. Projected Gradient Descent (PGD) is adopted for optimization.
Figure 3: The Immune2V pipeline jointly attacks spatial-temporal and semantic pathways, sustaining video-wide disruption.

Experimental Results and Analysis

The experimental protocol evaluates the resilience of Immune2V-immunized images against SOTA dual-stream I2V generators (Wan, DynamiCrafter, I2VGen-XL). The authors introduce both automated and VLM-based (Gemini) evaluation across video quality, temporal consistency, subject preservation, and motion smoothness metrics, benchmarking against clean inputs, random noise, and adapted image-only immunization (PhotoGuard variants). A broad dataset is synthesized using query+prompt pairs generated for the DAVIS benchmark.

Immune2V achieves the strongest and most persistent degradation in both conditioning fidelity and structural video quality metrics, outperforming prior image-only defenses by wide margins on text-prompt alignment, subject consistency, and motion plausibility—all with minimal perceptual corruption on the initial frame. Ablation studies demonstrate that eliminating either the semantic or spatial component substantially reduces attack efficacy (see Table 1). Increasing the perturbation budget amplifies degradation at the expense of input fidelity, confirming the importance of the selected trade-off.

Qualitative results reveal severe video breakdown, persistent distortion, and loss of motion regularity throughout the sequence, with immunized images yielding videos that fail to synthesize coherent subject or trajectory dynamics (Figures 5 and 6). In contrast, baseline defenses either leave the generative sequence largely unaffected or introduce visually noticeable artifacts only on the first frame.

Figure 4: Qualitative comparisons show that Immune2V consistently produces persistent collapse across video frames, unlike baselines that either fail to affect motion or overly degrade the initial frame only.

Figure 5: Additional qualitative results show persistent, temporally coherent breakdown in the spatial organization and subject consistency of videos generated from immunized images.

Robustness is further confirmed on alternative I2V architectures, and results generalize beyond the original model employed for optimization.

Theoretical and Practical Implications

This analysis establishes that defenses against unauthorized I2V generation require explicit video-aware adversarial mechanisms targeting both the spatial and semantic branches of dual-stream pipelines. Theoretical implications include the identification of model-level bottlenecks that make naive immunization strategies ineffective. Practically, Immune2V equips image owners with a tool to proactively safeguard their static visual data against animation-based abuse—an increasingly relevant threat as diffusion-based I2V models become ubiquitous and more robust.

By demonstrating the resilience of dual-stream architectures and the necessity of joint-space adversarial optimization, this work suggests future I2V protection will require even more sophisticated multi-modal and temporally distributed attack objectives. The effectivity of semantic-pathway manipulation should invite further research into cross-modal adversarial robustness and the design of conditionally robust generative models.

Speculation on Future Developments

Given the ongoing pace of video synthesis research, prospective developments may include: (1) generalized immunization frameworks that transfer across unseen architectures and text-prompt distributions; (2) robustifying dual-stream pipelines by introducing regularizers aware of adversarial immunization signals; and (3) extending immunization to higher-order conditioning, e.g., 3D, multimodal, or event-based I2V systems. The integration of proactive protection mechanisms like Immune2V into public datasets and social platforms may become standard for safeguarding identity and content provenance.

Conclusion

Immune2V systematically exposes and addresses the inherent limitations of input-level defenses in dual-stream I2V models by targeting both spatial-temporal and semantic conditioning pathways with imperceptible joint adversarial noise. It achieves strong, persistent, and video-wide disruption without degrading perceptual input quality, outperforming previous image-level defenses. These findings set a new benchmark for proactive video-level content protection, and outline critical theoretical considerations for future generative model security (2604.10837).

Markdown Report Issue