V-Warper: Training-Free Video Personalization

Updated 20 December 2025

V-Warper is a training-free, coarse-to-fine personalization framework that decouples subject identity encoding from high-frequency appearance detail injection.
It leverages lightweight image-stage adaptation with LoRA modules and a learnable subject token to capture global identity from limited reference imagery.
During inference, a dual-branch process uses RoPE-free semantic correspondences and spatial masking to inject fine appearance details, ensuring robust video generation.

V-Warper is a training-free, coarse-to-fine personalization framework for transformer-based video diffusion models, designed for user-driven video personalization where fine-grained appearance fidelity and prompt alignment are critical. Unlike previous methods that require heavy video-based finetuning or access to large video datasets, V-Warper decouples subject identity encoding from high-frequency appearance detail injection. It leverages a lightweight image-stage adaptation with Low-Rank Adapter (LoRA) modules and a learnable subject token, followed by an inference-only appearance warping stage guided by semantic correspondences derived from RoPE-free transformer mid-layer attention features and spatial masking. This architecture yields efficient, scalable, and robust video generation faithful to both prompt and subject identity (Lee et al., 13 Dec 2025).

1. Formulation and Objectives

V-Warper addresses the task of generating subject-driven videos from limited reference imagery and a text prompt $p$ , aiming to maintain both prompt-conformant motion/scene dynamics and faithful reproduction of fine-grained subject appearance. The approach introduces a dedicated subject token $s$ with embedding $e_s \in \mathbb{R}^d$ in the text encoder's space. During adaptation, $e_s$ is optimized so that when composing the prompt $c = [\mathrm{``a\ video\ of''}, s, p]$ , the diffusion model $\varepsilon_\theta(x_t, t, c)$ synthesizes images consistent with the subject's reference appearance.

Diffusion process conditioning is structured as follows: at each denoising timestep $t$ , the model receives a latent $x_t$ and the vector $c = \mathrm{concat}(e_s, E_{txt}(p))$ , and is trained using the standard objective

$L = \mathbb{E}_{x, \varepsilon, t} \left\| \varepsilon_\theta(x_t, t, c) - \varepsilon \right\|_2^2$

The internal Multi-Modal Attention (MMA) mechanism in the transformer attends over both visual and textual tokens, including the learnable subject embedding $e_s$ :

$\mathrm{MMA}([X; C_T]) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V$

where $C_T$ contains all relevant text and subject tokens.

2. Coarse Appearance Adaptation

The initial stage aims to fit a minimal set of adapter parameters so the model can encode the subject's global identity—encompassing pose, shape, and prominent colors—from as few as a handful of reference images while keeping motion and temporal priors intact.

A LoRA module is attached to each transformer's attention layer, augmenting the key, value, and output projections (but not queries, which maintain motion coherence). For any attention layer's projection weight $W \in \mathbb{R}^{d \times d}$ , learnable factors $A \in \mathbb{R}^{d \times r}$ and $B \in \mathbb{R}^{r \times d}$ with $r=128$ are optimized to yield $W' = W + AB$ . Only adapters for keys, values, and outputs are updated. In parallel, the new subject token's embedding $e_s \in \mathbb{R}^d$ is trained. Both parameter sets are updated using the denoising loss on reference images, requiring only a few hundred gradient steps.

This process operates with high efficiency: the parameter footprint is limited to $\mathcal{O}(dr)$ per projection plus a single $d$ -dimensional token, and no video sequences are needed. The method captures global subject identity with minimal memory and computation.

3. Inference-Time Fine Appearance Injection

To address the limitations of coarse adaptation, chiefly the lack of subtle textures, fine color gradations, and small geometric details, V-Warper employs a dual-branch, test-time procedure to inject high-frequency appearance features.

3.1 Semantic Correspondence via RoPE-Free Q–K

At each inference step $t-1$ , mid-layer ( $L=12$ ) query and key features are extracted from both the reference inversion and generation branches, with rotary position embeddings (RoPE) removed to eliminate positional bias. This results in $\,\overline{Q}_t,\,\overline{K}_t \in \mathbb{R}^{N \times d}$ , where $N$ is the number of visual tokens.

Bidirectional semantic correspondence matrices are computed:

$C^{gen \rightarrow ref}_t = \mathrm{softmax}(\overline{Q}^{gen}_{t-1} {(\overline{K}^{ref}_{t-1})}^\top / \sqrt{d})$
$C^{ref \rightarrow gen}_t = \mathrm{softmax}(\overline{Q}^{ref}_{t-1} {(\overline{K}^{gen}_{t-1})}^\top / \sqrt{d})$

Symmetrization yields $\widehat{C}_t = \frac{1}{2}\left(C^{gen \rightarrow ref}_t + (C^{ref \rightarrow gen}_t)^\top\right)$ , from which maximal flows $F^{gen \rightarrow ref}_t$ and $F^{ref \rightarrow gen}_t$ are derived.

3.2 Spatially Reliable Masking

Two binary masks are constructed on the generation branch:

Foreground mask $M_{fg}$ identifies tokens where the average subject-token attention exceeds threshold $\tau_{fg}$ .
Cycle-consistency mask $M_{cc}$ selects tokens with round-trip flow error $E_{cc}$ less than $\tau_{cc} \cdot (HW/\sum M_{fg})$ .

The final injection mask $M_t = M_{fg} \odot M_{cc}$ restricts appearance updates to spatially and semantically reliable regions.

3.3 Value Feature Warping

Within each MMA layer at step $t$ , value matrices $V^{ref}_{t,vid}$ and $V^{gen}_{t,vid}$ are linearly warped according to $F^{ref \rightarrow gen}_t$ , giving $V^{warp}_{t,vid} = W(V^{ref}_{t,vid}; F^{ref\rightarrow gen}_t)$ . Injection is performed as:

$\widehat{V}^{gen}_{t,vid} = M_t \odot V^{warp}_{t,vid} + (1 - M_t) \odot V^{gen}_{t,vid}$

Text-token values remain unchanged. This process exposes the generation branch to high-frequency, appearance-rich details at semantically aligned locations, preserving prompt and motion fidelity.

4. Algorithmic Outline

The following outlines the core V-Warper pipeline:

Coarse Adaptation (training phase)
- Randomly initialize LoRA adapters and subject embedding $e_s$ .
- For each training step:
  - Sample a reference image, noise, and timestep.
  - Generate prompt embedding with token $s$ .
  - Predict noise $\hat{\varepsilon}$ using LoRA-augmented model.
  - Compute loss and update LoRA parameters and $e_s$ via backpropagation.
Inference with Fine Appearance Injection (test phase)
- Obtain reference inversion through DDIM inversion.
- From noise, sample the generation branch.
- For each denoising step (selected injection range):
  - Denoise both branches.
  - Extract RoPE-free Q, K features.
  - Compute and symmetrize correspondences; derive flow fields.
  - Calculate masks $M_t$ .
  - For selected MMA layers, inject warped value features where masked.
- Decode the latent to obtain the output video.

5. Experimental Evaluation

On a benchmark comprising 10 subjects and 10 prompts, V-Warper outperforms established baselines including DreamVideo, VideoBooth, SDVG, and VACE across both fine detail (DINO-I) and global identity (CLIP-I) metrics:

Model	DINO-I	CLIP-I	CLIP-T
DreamVideo	0.322	0.641	—
VideoBooth	0.349	0.634	—
SDVG	0.661	0.787	—
VACE	0.651	0.796	0.326
V-Warper	0.738	0.825	0.297

V-Warper achieves DINO-I 0.738 (fine detail), CLIP-I 0.825 (identity), and remains competitive on CLIP-T (text alignment), with 0.297. User studies (N=20 raters, 40 comparisons) demonstrate consistent superiority over VACE in text alignment, subject fidelity, motion coherence, and overall quality.

Ablation studies reveal:

Coarse adaptation alone achieves reasonable identity (DINO-I=0.645) but lacks texture detail.
Value warping further improves DINO-I to 0.701 but may degrade CLIP-T due to appearance leakage.
Addition of masking recovers CLIP-T (0.320), balancing fidelity and alignment (DINO-I=0.656, CLIP-I=0.806, CLIP-T=0.320).

6. Limitations and Prospects

V-Warper’s requirement for per-subject optimization remains, although only a small token and LoRA adapters are trained; eliminating this upfront adaptation is a target for future research. The approach's strategy of freezing query adapter weights generally preserves learned temporal priors, but may attenuate some nuances of motion expressiveness—a hybridization with fine-grained temporal modeling could improve this aspect. While inference-time overhead is modest (correspondence and warping across ~10 transformer layers), deployment remains practical on a single GPU. Further efficiency gains are possible by accelerating correspondence and warping subroutines.

A plausible implication is that correspondence-guided, training-free subject personalization generalizes beyond video diffusion to other generative modalities. V-Warper exemplifies a state-of-the-art methodology where LoRA-based global adaptation and RoPE-free, semantic value warping are combined to reliably translate reference appearance into temporally coherent videos, all without video-based finetuning (Lee et al., 13 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping (2025)

V-Warper: Training-Free Video Personalization

1. Formulation and Objectives

2. Coarse Appearance Adaptation

3. Inference-Time Fine Appearance Injection

3.1 Semantic Correspondence via RoPE-Free Q–K

3.2 Spatially Reliable Masking

3.3 Value Feature Warping

4. Algorithmic Outline

5. Experimental Evaluation

6. Limitations and Prospects

Whiteboard

Follow Topic

Continue Learning

V-Warper: Training-Free Video Personalization

1. Formulation and Objectives

2. Coarse Appearance Adaptation

3. Inference-Time Fine Appearance Injection

3.1 Semantic Correspondence via RoPE-Free Q–K

3.2 Spatially Reliable Masking

3.3 Value Feature Warping

4. Algorithmic Outline

5. Experimental Evaluation

6. Limitations and Prospects

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics