Papers
Topics
Authors
Recent
2000 character limit reached

SMRABooth: Custom Video Generation Framework

Updated 20 December 2025
  • SMRABooth is a framework for customized video generation that precisely aligns subject appearance and motion patterns using subject and motion LoRA modules.
  • It leverages a frozen diffusion backbone augmented with specialized encoders like DINOv2-ViT and SEA-RAFT to optimize structural identity and motion consistency.
  • The approach employs a structured multi-stage training pipeline with decoupled LoRA injection, achieving significant improvements in semantic alignment and perceptual video quality.

SMRABooth is a framework for customized video generation that aligns subject appearance and motion patterns at the object level, providing precise control over video outputs by leveraging specialized representation encoders and a structured LoRA-based fine-tuning regime. It addresses the limitations of prior methods, which typically struggle to simultaneously preserve the subject's structural identity and maintain coherent subject-specific motion, by introducing three core technical components: subject representation alignment, motion representation alignment, and a decoupled sparse LoRA injection strategy (Xu et al., 13 Dec 2025).

1. Architectural Foundations and Key Components

SMRABooth builds on a frozen, pre-trained text-to-video diffusion backbone, such as WAN2.1 (DiT-based) or ZeroScope (U-Net-based), and augments it with two specialized, low-rank adaptation (LoRA) modules: a "subject LoRA" and a "motion LoRA". The system is organized into three serial stages:

  1. Subject Representation Alignment (SuRA): Utilizes a frozen self-supervised image encoder (DINOv2-ViT) to extract global, high-level patch embeddings y∗∈RN×D\mathbf{y}^*\in\mathbb{R}^{N\times D} from reference subject images, serving as targets to steer the intermediate features of the diffusion backbone. Subject masks, typically generated with SAM, restrict losses to object regions.
  2. Motion Representation Alignment (MoRA): Employs a specialized optical flow network (SEA-RAFT) to compute pixel-level motion vectors Ft,t+1=F(xt,xt+1)\mathbf{F}_{t, t+1}=F(x_t, x_{t+1}) from reference videos, capturing object-level motion independent of appearance.
  3. Subject–Motion Association Decoupling: Enforces LoRA injection sparsity in the transformer's layer and temporal dimensions, isolating subject and motion adaptation to reduce inter-task interference. Subject LoRA is applied at {Q,K,FFN.0}\{Q,K,\mathrm{FFN}.0\} and motion LoRA at {V,O,FFN.0,FFN.2}\{V,O,\mathrm{FFN}.0,\mathrm{FFN}.2\}, combined via timestep gating during inference.

A frozen backbone ensures base capabilities are preserved while LoRA adapters inject highly-targeted subject and motion instructions.

2. Optimization Objectives and Mathematical Formulation

SMRABooth's training objectives are structured to separately align the backbone's learned representations to both subject appearance and motion trajectory:

Subject Representation Alignment (SuRA)

  • Masked Velocity-Prediction Loss:

Lregion=Ez0,z1,ctxt,t∥[u(zt,ctxt,t;θ)−vt]⊙M∥2,\mathcal L_{\rm region} = \mathbb{E}_{z_0,z_1,c_{\rm txt},t}\|[u(z_t,c_{\rm txt},t;\theta)-v_t]\odot M\|^2,

where vt=z1−z0v_t=z_1-z_0 is the target velocity and MM is a subject mask.

  • Patch-Wise Cosine Similarity:

LSuRA=−Ezt1,M,t1N∑n=1Ny∗[n]⋅hϕ(zt1[n]) ∥y∗[n]∥  ∥hϕ(zt1[n])∥.\mathcal L_{\rm SuRA} = -\mathbb{E}_{z_t^1,M,t}\frac{1}{N}\sum_{n=1}^N \frac{y^{*[n]}\cdot h_\phi(z_t^{1[n]})} {\,\|y^{*[n]}\|\;\|h_\phi(z_t^{1[n]})\|}.

  • Stage Objective:

Lsubj_stage=Lregion+λLSuRA,λ=0.05.\mathcal L_{\rm subj\_stage} = \mathcal L_{\rm region}+\lambda \mathcal L_{\rm SuRA}, \quad \lambda=0.05.

Motion Representation Alignment (MoRA)

  • Flow Alignment Loss:

LMoRA=∥F{1,N}−F~{1,N}∥1,\mathcal L_{\rm MoRA} = \|\mathbf F_{\{1,N\}}-\widetilde{\mathbf F}_{\{1,N\}}\|_1,

computed between reference and generated frame flows.

  • Temporal Velocity-Prediction Loss:

Ltemporal=Ez0,z1,ctxt,t∥u(zt,ctxt,t;θ)−vt∥2.\mathcal L_{\rm temporal} = \mathbb{E}_{z_0,z_1,c_{\rm txt},t}\|u(z_t,c_{\rm txt},t;\theta)-v_t\|^2.

  • Stage Objective:

Lmot_stage=Ltemporal+αLMoRA,α=1.0.\mathcal L_{\rm mot\_stage} = \mathcal L_{\rm temporal}+\alpha \mathcal L_{\rm MoRA},\quad\alpha=1.0.

Sparse LoRA Injection and Decoupling

  • Layer Selection Masks:

Ssubj={Q,K,FFN.0},Smot={V,O,FFN.0,FFN.2}.\mathcal S_{\rm subj}=\{Q,K,\mathrm{FFN}.0\},\quad \mathcal S_{\rm mot}=\{V,O,\mathrm{FFN}.0,\mathrm{FFN}.2\}.

  • Timestep Scheduling:

ws(t)={β,t≤Tpoint 2β,t>TpointTpoint=15,  β=0.5.w_s(t)= \begin{cases} \beta,& t\leq T_{\rm point}\ 2\beta,& t> T_{\rm point} \end{cases} \quad T_{\rm point}=15,\;\beta=0.5.

Motion LoRA is used throughout with a fixed weight.

Training proceeds sequentially: subject LoRA is trained first with backbone weights frozen, then motion LoRA is trained, also keeping the backbone and subject LoRA fixed.

3. Pipeline Algorithms

SMRABooth's core stages can be summarized via concise pseudocode for each phase, which delineates data preparation, encoder extraction, backbone integration, loss calculation, and LoRA parameter updates.

Stage Inputs/Encoders Optimization/Gating
SuRA Reference images, ViT, Subject LoRA Masked loss, cosine sim.
MoRA Reference videos, SEA-RAFT, Motion LoRA Flow loss, L1/temporal
Inference Text prompt, both LoRAs Sparse layer/timestep gating

The inference process applies weighted LoRA deltas according to the pre-defined layer and timestep schedules, merging them into the backbone for each diffusion step, followed by flow-matching reverse updates and frame decoding.

4. Implementation Specifications and Dataset Construction

The implementation utilizes two NVIDIA H20 GPUs (96GB each). Backbone options include WAN2.1 (DiT, 1.3B parameters) or ZeroScope (U-Net). Hyperparameters are as follows: subject LoRA rank rs=32r_s=32 (LR 1×10−41\times10^{-4}, 300 steps, λ=0.05\lambda=0.05), and motion LoRA rank rm=64r_m=64 (LR 1×10−41\times10^{-4}, 400 steps, α=1.0\alpha=1.0). The dataset comprises 30 subjects (objects such as pets, toys, buildings) annotated with masks and prompt templates, and 21 motion patterns (e.g., linear, curvilinear, rotational, sports, musical instruments) described by GPT-4o–refined captions. Inference is performed on 560 subject–motion input combinations with 50-step DDIM, classifier-free guidance, 49 output frames at 15 fps, and 832×480 resolution.

5. Quantitative Evaluation and Ablation

Evaluation metrics include semantic alignment (CLIP-T, CLIP-I, DINO-I), motion quality (Motion Fidelity, Subject Consistency, Temporal Consistency), and perceptual quality (PickScore, Aesthetic Quality, Imaging Quality). Comparative results on WAN2.1 (DiT-based) are summarized:

Method CLIP-T CLIP-I DINO-I Motion Fid PickScore Aesth ImgQ
WAN2.1 0.339 0.586 0.165 35.18 19.96 61.09 65.12
+LoRA 0.314 0.681 0.464 60.08 19.85 56.17 55.90
DualReal 0.351 0.692 0.509 45.75 20.58 61.03 65.27
SMRABooth (Ours) 0.363 0.700 0.519 62.89 21.14 62.18 67.46

Ablations on WAN2.1 (DiT) show that removal of SuRA drops CLIP-I/DINO-I by approximately 0.03/0.05 and PickScore by 0.07, while removing MoRA reduces Motion Fidelity from 62.89 to 60.02. Full-layer vs. sparse-layer LoRA demonstrates that sparsity at selected sub-layers achieves optimal subject–motion separation.

A user study (n=100n=100, 8,400 ratings) corroborates metric trends, showing higher scores for SMRABooth on prompt alignment, motion/appearance similarity, and overall video quality relative to DualReal.

6. Qualitative Analyses and Failure Modes

Qualitative analyses, covering subjects such as dogs, plush toys, buildings, and vehicles, illustrate high-fidelity appearance retention, including fine geometric and textural features (e.g., fur detail, plush seams, brickwork, reflective surfaces). Motions sampled include linear (car driving), curvilinear (rollercoaster), rotation (spinning top), sports (basketball spin), and musical instruments (guitar strum), with object trajectories closely matched to reference optical flow data. Joint customization scenarios, such as "a red bicycle" executing a "360° spin," demonstrate frame-level consistency in both appearance and motion trajectory.

Reported baseline failure modes include WAN2.1+LoRA copying source video backgrounds into outputs and DualReal generating incoherent or floating objects when tasked with complex subject–motion pairings. SMRABooth's object-level decoupling and representation alignment enables effective mitigation of such artifacts (Xu et al., 13 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SMRABooth.