Reference-Decoupled Causal Learning (RCL)
- RCL is a module within OmniTransfer that enforces a strictly causal token dependency to ensure precise transfer of appearance and motion cues.
- It decouples reference and target branches using unidirectional attention, reducing computational cost and avoiding trivial copy-paste artifacts.
- Empirical results show RCL improves appearance consistency and temporal fidelity while reducing inference time by approximately 20%.
Reference-decoupled Causal Learning (RCL) is a module introduced in the OmniTransfer framework for spatio-temporal video transfer, designed to enable precise, high-fidelity transfer of appearance and motion cues from a reference video to a target video sequence. RCL achieves this by imposing a strictly causal architecture that decouples reference and target token streams and precludes trivial copying in the attention mechanism, thereby facilitating efficient and robust transfer without sacrificing consistency or temporal alignment (Zhang et al., 20 Jan 2026).
1. Motivation and Objectives
RCL addresses critical limitations observed in video transfer models that apply bidirectional self-attention between reference and target tokens. Bidirectional attention can induce trivial “copy-paste” behaviors, wherein the model reconstructs reference patches in the target instead of selectively transferring relevant appearance or motion cues. Additionally, standard joint attention mechanisms increase computational complexity: for tokens each in reference and target, full self-attention has complexity, quadrupling the FLOPs as compared to decoupled variants. The objective of RCL is twofold: (i) to enforce a strictly causal (unidirectional) dependency from reference to target—i.e., allow information to flow only ref tgt and forbid tgt ref—and (ii) to decouple the branches so that the reference encoding requires only one forward pass, reducing both computation and memory (Zhang et al., 20 Jan 2026).
2. Architectural Formulation
Within each DiT (Diffusion Transformer) block, RCL maintains two independent sequences of tokens: one for the reference video and one for the target. The reference branch processes latent feature tensors , projected via separate linear layers to . Task-aware rotational position encoding (RoPE), denoted , is applied prior to self-attention, ensuring that the positional semantics are adaptive to the underlying transfer task.
The self-attention for the reference branch is computed as:
where is the key dimension. Crucially, a fixed time embedding is used, rendering this branch time-invariant and enabling persistent caching across all diffusion timesteps.
For the target branch, input tokens are processed analogously, but with standard (undistorted) RoPE. At the attention layer, the target queries attend jointly to their own keys/values and those of the reference: \begin{align*} & \text{Attn}\text{tgt} = \mathrm{softmax}\left(\frac{R\theta(Q_\text{tgt}) \cdot [R_\theta(K_\text{tgt}); R*\theta(K\text{ref})]\top}{\sqrt{d}} \right) [V_\text{tgt}; V_\text{ref}] \end{align*} This concatenation restricts information flow to ref tgt: the target can access reference keys/values, but reference tokens never attend to the target. The output flows through the residual, normalization, and MLP paths as in standard DiT blocks.
3. Causal Structure and Interventions
The latent feature dependencies instantiated by RCL implement a distinct causal structure. Let represent all reference latents, the target latents, and and the post-attention outputs for each branch. The implied directed acyclic graph connects , , and , with no edge . Formally, the update equations are:
Interventional analysis is supported by setting (i.e., masking reference tokens during attention), which effectively converts the model to a standard self-attention over target latents. As the reference branch never observes target information, is invariant under any intervention on the target branch.
4. Mathematical Details and Training Objective
RCL does not introduce new task-specific loss terms or regularizers. Instead, training utilizes the canonical denoising diffusion loss. At each diffusion step , the model minimizes the expected squared norm: where is the predicted noise conditioned on reference and target tokens and denotes any other prompt information. The only architectural constraints derive from the causal attention masks: the reference never attends to target tokens at any stage.
5. Implementation Specifics
RCL can be expressed in modular pseudocode for a single DiT block. This procedure comprises: linear projection to queries, keys, and values; task-aware RoPE and standard RoPE application; self-attention for the reference branch; concatenated cross-attention for the target; and subsequent feed-forward processing. This design enables caching of the reference branch because of its deterministic, fixed-time encoding.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
def RCL_block(l_ref, l_tgt, time_emb_ref, time_emb_tgt, prompt_emb): # l_ref: [f_r,h_r,w_r,D], l_tgt: [f_t,h_t,w_t,D] Q_ref, K_ref, V_ref = linQ_ref(l_ref), linK_ref(l_ref), linV_ref(l_ref) Q_tgt, K_tgt, V_tgt = linQ_tgt(l_tgt), linK_tgt(l_tgt), linV_tgt(l_tgt) Q_ref_p = R_star(Q_ref) # R^*_θ(Q_ref) K_ref_p = R_star(K_ref) # R^*_θ(K_ref) Q_tgt_p = R(Q_tgt) # R_θ(Q_tgt) K_tgt_p = R(K_tgt) # R_θ(K_tgt) attn_weights_ref = softmax( (Q_ref_p @ K_ref_p.T) / sqrt(d) ) Y_ref = attn_weights_ref @ V_ref K_cat = concat(K_tgt_p, K_ref_p, axis=token_dim) V_cat = concat(V_tgt, V_ref, axis=token_dim) attn_weights_tgt = softmax( (Q_tgt_p @ K_cat.T) / sqrt(d) ) Y_tgt = attn_weights_tgt @ V_cat out_ref = MLP( LayerNorm(Y_ref + l_ref) ) out_tgt = MLP( LayerNorm(Y_tgt + l_tgt) ) return out_ref, out_tgt |
6. Empirical Outcomes
Table 7 of OmniTransfer reports empirical gains attributable to RCL. With task-aware positional bias (TPB) already in place, introducing RCL increases user study scores for appearance consistency/quality (from 2.82/2.86 to 3.10/3.16) and for temporal fidelity (from 2.95/2.94 to 3.13/3.10). RCL also yields practical acceleration: inference time for an 81-frame, 480p video on 8 × A100 GPUs drops from approximately 180 s to 142 s, a roughly 20% reduction. Qualitative results indicate RCL eliminates the “copy-paste” artifact, producing natural blending of styles and motions. These outcomes substantiate RCL’s benefits in enforcing the correct causal regime (ref tgt) while improving both efficiency and fidelity by avoiding trivial memorization (Zhang et al., 20 Jan 2026).