Papers
Topics
Authors
Recent
2000 character limit reached

V-Warper: Training-Free Video Personalization

Updated 20 December 2025
  • V-Warper is a training-free, coarse-to-fine personalization framework that decouples subject identity encoding from high-frequency appearance detail injection.
  • It leverages lightweight image-stage adaptation with LoRA modules and a learnable subject token to capture global identity from limited reference imagery.
  • During inference, a dual-branch process uses RoPE-free semantic correspondences and spatial masking to inject fine appearance details, ensuring robust video generation.

V-Warper is a training-free, coarse-to-fine personalization framework for transformer-based video diffusion models, designed for user-driven video personalization where fine-grained appearance fidelity and prompt alignment are critical. Unlike previous methods that require heavy video-based finetuning or access to large video datasets, V-Warper decouples subject identity encoding from high-frequency appearance detail injection. It leverages a lightweight image-stage adaptation with Low-Rank Adapter (LoRA) modules and a learnable subject token, followed by an inference-only appearance warping stage guided by semantic correspondences derived from RoPE-free transformer mid-layer attention features and spatial masking. This architecture yields efficient, scalable, and robust video generation faithful to both prompt and subject identity (Lee et al., 13 Dec 2025).

1. Formulation and Objectives

V-Warper addresses the task of generating subject-driven videos from limited reference imagery and a text prompt pp, aiming to maintain both prompt-conformant motion/scene dynamics and faithful reproduction of fine-grained subject appearance. The approach introduces a dedicated subject token ss with embedding es∈Rde_s \in \mathbb{R}^d in the text encoder's space. During adaptation, ese_s is optimized so that when composing the prompt c=[‘‘a video of′′,s,p]c = [\mathrm{``a\ video\ of''}, s, p], the diffusion model εθ(xt,t,c)\varepsilon_\theta(x_t, t, c) synthesizes images consistent with the subject's reference appearance.

Diffusion process conditioning is structured as follows: at each denoising timestep tt, the model receives a latent xtx_t and the vector c=concat(es,Etxt(p))c = \mathrm{concat}(e_s, E_{txt}(p)), and is trained using the standard objective

L=Ex,ε,t∥εθ(xt,t,c)−ε∥22L = \mathbb{E}_{x, \varepsilon, t} \left\| \varepsilon_\theta(x_t, t, c) - \varepsilon \right\|_2^2

The internal Multi-Modal Attention (MMA) mechanism in the transformer attends over both visual and textual tokens, including the learnable subject embedding ese_s:

MMA([X;CT])=softmax(QK⊤d)V\mathrm{MMA}([X; C_T]) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V

where CTC_T contains all relevant text and subject tokens.

2. Coarse Appearance Adaptation

The initial stage aims to fit a minimal set of adapter parameters so the model can encode the subject's global identity—encompassing pose, shape, and prominent colors—from as few as a handful of reference images while keeping motion and temporal priors intact.

A LoRA module is attached to each transformer's attention layer, augmenting the key, value, and output projections (but not queries, which maintain motion coherence). For any attention layer's projection weight W∈Rd×dW \in \mathbb{R}^{d \times d}, learnable factors A∈Rd×rA \in \mathbb{R}^{d \times r} and B∈Rr×dB \in \mathbb{R}^{r \times d} with r=128r=128 are optimized to yield W′=W+ABW' = W + AB. Only adapters for keys, values, and outputs are updated. In parallel, the new subject token's embedding es∈Rde_s \in \mathbb{R}^d is trained. Both parameter sets are updated using the denoising loss on reference images, requiring only a few hundred gradient steps.

This process operates with high efficiency: the parameter footprint is limited to O(dr)\mathcal{O}(dr) per projection plus a single dd-dimensional token, and no video sequences are needed. The method captures global subject identity with minimal memory and computation.

3. Inference-Time Fine Appearance Injection

To address the limitations of coarse adaptation, chiefly the lack of subtle textures, fine color gradations, and small geometric details, V-Warper employs a dual-branch, test-time procedure to inject high-frequency appearance features.

3.1 Semantic Correspondence via RoPE-Free Q–K

At each inference step t−1t-1, mid-layer (L=12L=12) query and key features are extracted from both the reference inversion and generation branches, with rotary position embeddings (RoPE) removed to eliminate positional bias. This results in  Q‾t, K‾t∈RN×d\,\overline{Q}_t,\,\overline{K}_t \in \mathbb{R}^{N \times d}, where NN is the number of visual tokens.

Bidirectional semantic correspondence matrices are computed:

  • Ctgen→ref=softmax(Q‾t−1gen(K‾t−1ref)⊤/d)C^{gen \rightarrow ref}_t = \mathrm{softmax}(\overline{Q}^{gen}_{t-1} {(\overline{K}^{ref}_{t-1})}^\top / \sqrt{d})
  • Ctref→gen=softmax(Q‾t−1ref(K‾t−1gen)⊤/d)C^{ref \rightarrow gen}_t = \mathrm{softmax}(\overline{Q}^{ref}_{t-1} {(\overline{K}^{gen}_{t-1})}^\top / \sqrt{d})

Symmetrization yields C^t=12(Ctgen→ref+(Ctref→gen)⊤)\widehat{C}_t = \frac{1}{2}\left(C^{gen \rightarrow ref}_t + (C^{ref \rightarrow gen}_t)^\top\right), from which maximal flows Ftgen→refF^{gen \rightarrow ref}_t and Ftref→genF^{ref \rightarrow gen}_t are derived.

3.2 Spatially Reliable Masking

Two binary masks are constructed on the generation branch:

  • Foreground mask MfgM_{fg} identifies tokens where the average subject-token attention exceeds threshold Ï„fg\tau_{fg}.
  • Cycle-consistency mask MccM_{cc} selects tokens with round-trip flow error EccE_{cc} less than Ï„ccâ‹…(HW/∑Mfg)\tau_{cc} \cdot (HW/\sum M_{fg}).

The final injection mask Mt=Mfg⊙MccM_t = M_{fg} \odot M_{cc} restricts appearance updates to spatially and semantically reliable regions.

3.3 Value Feature Warping

Within each MMA layer at step tt, value matrices Vt,vidrefV^{ref}_{t,vid} and Vt,vidgenV^{gen}_{t,vid} are linearly warped according to Ftref→genF^{ref \rightarrow gen}_t, giving Vt,vidwarp=W(Vt,vidref;Ftref→gen)V^{warp}_{t,vid} = W(V^{ref}_{t,vid}; F^{ref\rightarrow gen}_t). Injection is performed as:

V^t,vidgen=Mt⊙Vt,vidwarp+(1−Mt)⊙Vt,vidgen\widehat{V}^{gen}_{t,vid} = M_t \odot V^{warp}_{t,vid} + (1 - M_t) \odot V^{gen}_{t,vid}

Text-token values remain unchanged. This process exposes the generation branch to high-frequency, appearance-rich details at semantically aligned locations, preserving prompt and motion fidelity.

4. Algorithmic Outline

The following outlines the core V-Warper pipeline:

  1. Coarse Adaptation (training phase)
    • Randomly initialize LoRA adapters and subject embedding ese_s.
    • For each training step:
      • Sample a reference image, noise, and timestep.
      • Generate prompt embedding with token ss.
      • Predict noise ε^\hat{\varepsilon} using LoRA-augmented model.
      • Compute loss and update LoRA parameters and ese_s via backpropagation.
  2. Inference with Fine Appearance Injection (test phase)
    • Obtain reference inversion through DDIM inversion.
    • From noise, sample the generation branch.
    • For each denoising step (selected injection range):
      • Denoise both branches.
      • Extract RoPE-free Q, K features.
      • Compute and symmetrize correspondences; derive flow fields.
      • Calculate masks MtM_t.
      • For selected MMA layers, inject warped value features where masked.
    • Decode the latent to obtain the output video.

5. Experimental Evaluation

On a benchmark comprising 10 subjects and 10 prompts, V-Warper outperforms established baselines including DreamVideo, VideoBooth, SDVG, and VACE across both fine detail (DINO-I) and global identity (CLIP-I) metrics:

Model DINO-I CLIP-I CLIP-T
DreamVideo 0.322 0.641 —
VideoBooth 0.349 0.634 —
SDVG 0.661 0.787 —
VACE 0.651 0.796 0.326
V-Warper 0.738 0.825 0.297

V-Warper achieves DINO-I 0.738 (fine detail), CLIP-I 0.825 (identity), and remains competitive on CLIP-T (text alignment), with 0.297. User studies (N=20 raters, 40 comparisons) demonstrate consistent superiority over VACE in text alignment, subject fidelity, motion coherence, and overall quality.

Ablation studies reveal:

  • Coarse adaptation alone achieves reasonable identity (DINO-I=0.645) but lacks texture detail.
  • Value warping further improves DINO-I to 0.701 but may degrade CLIP-T due to appearance leakage.
  • Addition of masking recovers CLIP-T (0.320), balancing fidelity and alignment (DINO-I=0.656, CLIP-I=0.806, CLIP-T=0.320).

6. Limitations and Prospects

V-Warper’s requirement for per-subject optimization remains, although only a small token and LoRA adapters are trained; eliminating this upfront adaptation is a target for future research. The approach's strategy of freezing query adapter weights generally preserves learned temporal priors, but may attenuate some nuances of motion expressiveness—a hybridization with fine-grained temporal modeling could improve this aspect. While inference-time overhead is modest (correspondence and warping across ~10 transformer layers), deployment remains practical on a single GPU. Further efficiency gains are possible by accelerating correspondence and warping subroutines.

A plausible implication is that correspondence-guided, training-free subject personalization generalizes beyond video diffusion to other generative modalities. V-Warper exemplifies a state-of-the-art methodology where LoRA-based global adaptation and RoPE-free, semantic value warping are combined to reliably translate reference appearance into temporally coherent videos, all without video-based finetuning (Lee et al., 13 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to V-Warper.