Reference-Decoupled Causal Learning (RCL)

Updated 23 January 2026

RCL is a module within OmniTransfer that enforces a strictly causal token dependency to ensure precise transfer of appearance and motion cues.
It decouples reference and target branches using unidirectional attention, reducing computational cost and avoiding trivial copy-paste artifacts.
Empirical results show RCL improves appearance consistency and temporal fidelity while reducing inference time by approximately 20%.

Reference-decoupled Causal Learning (RCL) is a module introduced in the OmniTransfer framework for spatio-temporal video transfer, designed to enable precise, high-fidelity transfer of appearance and motion cues from a reference video to a target video sequence. RCL achieves this by imposing a strictly causal architecture that decouples reference and target token streams and precludes trivial copying in the attention mechanism, thereby facilitating efficient and robust transfer without sacrificing consistency or temporal alignment (Zhang et al., 20 Jan 2026).

1. Motivation and Objectives

RCL addresses critical limitations observed in video transfer models that apply bidirectional self-attention between reference and target tokens. Bidirectional attention can induce trivial “copy-paste” behaviors, wherein the model reconstructs reference patches in the target instead of selectively transferring relevant appearance or motion cues. Additionally, standard joint attention mechanisms increase computational complexity: for $N$ tokens each in reference and target, full self-attention has $O((2N)^2)$ complexity, quadrupling the FLOPs as compared to decoupled variants. The objective of RCL is twofold: (i) to enforce a strictly causal (unidirectional) dependency from reference to target—i.e., allow information to flow only ref $\rightarrow$ tgt and forbid tgt $\rightarrow$ ref—and (ii) to decouple the branches so that the reference encoding requires only one forward pass, reducing both computation and memory (Zhang et al., 20 Jan 2026).

2. Architectural Formulation

Within each DiT (Diffusion Transformer) block, RCL maintains two independent sequences of tokens: one for the reference video and one for the target. The reference branch processes latent feature tensors $l_\text{ref} \in \mathbb{R}^{f_r \times h_r \times w_r \times D}$ , projected via separate linear layers to $Q_\text{ref}, K_\text{ref}, V_\text{ref}$ . Task-aware rotational position encoding (RoPE), denoted $R^*_\theta(\cdot)$ , is applied prior to self-attention, ensuring that the positional semantics are adaptive to the underlying transfer task.

The self-attention for the reference branch is computed as:

$\text{Attn}_\text{ref} = \mathrm{softmax}\left(\frac{R^*_\theta(Q_\text{ref}) \cdot R^*_\theta(K_\text{ref})^\top}{\sqrt{d}}\right)V_\text{ref}$

where $d$ is the key dimension. Crucially, a fixed time embedding $t=0$ is used, rendering this branch time-invariant and enabling persistent caching across all diffusion timesteps.

For the target branch, input tokens $l_\text{tgt}$ are processed analogously, but with standard (undistorted) RoPE. At the attention layer, the target queries attend jointly to their own keys/values and those of the reference: \begin{align*} & \text{Attn}\text{tgt} = \mathrm{softmax}\left(\frac{R\theta(Q_\text{tgt}) \cdot [R_\theta(K_\text{tgt}); R^{*\theta(K\text{ref})]^{\top}{\sqrt{d}}}} \right) [V_\text{tgt}; V_\text{ref}] \end{align*} This concatenation restricts information flow to ref $\rightarrow$ tgt: the target can access reference keys/values, but reference tokens never attend to the target. The output flows through the residual, normalization, and MLP paths as in standard DiT blocks.

3. Causal Structure and Interventions

The latent feature dependencies instantiated by RCL implement a distinct causal structure. Let $R$ represent all reference latents, $T$ the target latents, and $Y_\text{ref}$ and $Y_\text{tgt}$ the post-attention outputs for each branch. The implied directed acyclic graph connects $R \rightarrow Y_\text{ref}$ , $R \rightarrow Y_\text{tgt}$ , and $T \rightarrow Y_\text{tgt}$ , with no edge $T \rightarrow Y_\text{ref}$ . Formally, the update equations are: $Y_\text{ref} = \text{Attn}_\text{ref}(R^*_\theta(Q_\text{ref}), R^*_\theta(K_\text{ref}), V_\text{ref})$

$Y_\text{tgt} = \text{Attn}_\text{tgt}(R_\theta(Q_\text{tgt}), [R_\theta(K_\text{tgt}); R^*_\theta(K_\text{ref})], [V_\text{tgt}; V_\text{ref}])$

Interventional analysis is supported by setting $V_\text{ref} \equiv 0$ (i.e., masking reference tokens during attention), which effectively converts the model to a standard self-attention over target latents. As the reference branch never observes target information, $Y_\text{ref}$ is invariant under any intervention on the target branch.

4. Mathematical Details and Training Objective

RCL does not introduce new task-specific loss terms or regularizers. Instead, training utilizes the canonical denoising diffusion loss. At each diffusion step $t$ , the model minimizes the expected squared norm: $L_\text{diff} = \mathbb{E}_{z,\epsilon\sim \mathcal{N}(0,I),t}\left[\left\| \epsilon - \hat{\epsilon}_\theta(z^t, l_\text{ref}, l_\text{tgt}, p) \right\|^2\right]$ where $\hat{\epsilon}_\theta$ is the predicted noise conditioned on reference and target tokens and $p$ denotes any other prompt information. The only architectural constraints derive from the causal attention masks: the reference never attends to target tokens at any stage.

5. Implementation Specifics

RCL can be expressed in modular pseudocode for a single DiT block. This procedure comprises: linear projection to queries, keys, and values; task-aware RoPE and standard RoPE application; self-attention for the reference branch; concatenated cross-attention for the target; and subsequent feed-forward processing. This design enables caching of the reference branch because of its deterministic, fixed-time encoding.

def RCL_block(l_ref, l_tgt, time_emb_ref, time_emb_tgt, prompt_emb):
    # l_ref: [f_r,h_r,w_r,D], l_tgt: [f_t,h_t,w_t,D]
    Q_ref, K_ref, V_ref = linQ_ref(l_ref), linK_ref(l_ref), linV_ref(l_ref)
    Q_tgt, K_tgt, V_tgt = linQ_tgt(l_tgt), linK_tgt(l_tgt), linV_tgt(l_tgt)
    Q_ref_p = R_star(Q_ref)      # R^*_θ(Q_ref)
    K_ref_p = R_star(K_ref)      # R^*_θ(K_ref)
    Q_tgt_p = R(Q_tgt)           # R_θ(Q_tgt)
    K_tgt_p = R(K_tgt)           # R_θ(K_tgt)
    attn_weights_ref = softmax( (Q_ref_p @ K_ref_p.T) / sqrt(d) )
    Y_ref = attn_weights_ref @ V_ref
    K_cat = concat(K_tgt_p, K_ref_p, axis=token_dim)
    V_cat = concat(V_tgt,    V_ref,   axis=token_dim)
    attn_weights_tgt = softmax( (Q_tgt_p @ K_cat.T) / sqrt(d) )
    Y_tgt = attn_weights_tgt @ V_cat
    out_ref = MLP( LayerNorm(Y_ref + l_ref) )
    out_tgt = MLP( LayerNorm(Y_tgt + l_tgt) )
    return out_ref, out_tgt

At inference, the reference branch is evaluated only once due to its fixed

t=0

time embedding, while the target branch is computed at each diffusion step. This ensures that sequence growth at each step increases by only the reference key/value tokens, rather than by duplicating the sequence length.

6. Empirical Outcomes

Table 7 of OmniTransfer reports empirical gains attributable to RCL. With task-aware positional bias (TPB) already in place, introducing RCL increases user study scores for appearance consistency/quality (from 2.82/2.86 to 3.10/3.16) and for temporal fidelity (from 2.95/2.94 to 3.13/3.10). RCL also yields practical acceleration: inference time for an 81-frame, 480p video on 8 × A100 GPUs drops from approximately 180 s to 142 s, a roughly 20% reduction. Qualitative results indicate RCL eliminates the “copy-paste” artifact, producing natural blending of styles and motions. These outcomes substantiate RCL’s benefits in enforcing the correct causal regime (ref $\rightarrow$ tgt) while improving both efficiency and fidelity by avoiding trivial memorization (Zhang et al., 20 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reference-decoupled Causal Learning (RCL).