Context-Style CFG: Disentangled Guidance

Updated 12 October 2025

The paper introduces a novel factorized guidance mechanism that independently manipulates context and style for improved multimodal generation fidelity.
CS-CFG employs a null context construction technique to isolate video content, ensuring semantic preservation during aggressive style adaptation.
The method integrates low-rank adapters within diffusion models to achieve superior temporal coherence and style consistency in video-to-video transfer.

Context-Style Classifier-Free Guidance (CS-CFG) is a variant and factorization of classifier-free guidance designed to enable independent control of context and style in conditional generative models, particularly in multimodal domains such as video-to-video style transfer with diffusion backbones. It generalizes the classic guidance method by decomposing the conditional signal into separate directions—typically, “context” (e.g., source video content) and “style” (e.g., visual appearance specified by a text prompt)—and introduces mechanisms for selectively reinforcing or preserving each component during the sampling process. CS-CFG is notable for its algorithmic structure, mathematical rigor, and empirical success in preserving semantic attributes while facilitating high-fidelity style adaptation.

1. Factorization of Condition Signals in Classifier-Free Guidance

Traditional classifier-free guidance (CFG) operates by linearly combining conditional and unconditional model predictions, often denoted as

$\epsilon^{\text{CFG}} = \epsilon_\text{uncond} + w \cdot (\epsilon_\text{cond} - \epsilon_\text{uncond}),$

where $w$ is the guidance scale parameter. In CS-CFG, this paradigm is extended to multimodal conditioning, where the input comprises a context signal (e.g., video tensor $\mathcal{C}$ ) and a style signal (text prompt $\mathcal{T}$ ). The method factorizes the guidance into independent text (style) and video (context) update directions, enabling fine-grained manipulation of each:

The style direction is constructed from the difference between the prediction under $(\mathcal{T}, \mathcal{C})$ and $(\varnothing, \mathcal{C})$ .
The context direction is derived from the difference between $(\varnothing, \mathcal{C})$ and $(\varnothing, \mathcal{C}_{\text{null}})$ , where $\mathcal{C}_{\text{null}}$ is a randomly permuted version of the video context, disrupting temporal and spatial coherence to effectively “drop” context (as per equation (1) in (Mehraban et al., 8 Oct 2025)).

The final guided prediction at denoising timestep $t$ is given by:

$\hat{\epsilon}_t = \epsilon_{\text{null\_text}} + t_{\text{guide}} \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{null\_text}}) + c_{\text{guide}} \cdot (\epsilon_{\text{null\_text}} - \epsilon_{\text{null}})$

where $t_{\text{guide}}$ and $c_{\text{guide}}$ independently scale the style and context contributions.

2. Null Context Construction and Its Role

A principal element of CS-CFG is the method for constructing the null context $\mathcal{C}_{\text{null}}$ . The procedure applies independent uniform random permutations to the temporal ( $T$ ), height ( $H$ ), and width ( $W$ ) axes:

$\mathcal{C}_{\text{null}} = \pi_W \cdot \pi_H \cdot \pi_T \cdot \mathcal{C}$

where $\pi_W$ , $\pi_H$ , $\pi_T$ are sampled from the respective symmetric groups $S_W$ , $S_H$ , $S_T$ . This design ensures that the model’s prediction under $(\varnothing, \mathcal{C}_{\text{null}})$ receives no meaningful content information, allowing the context direction to be properly isolated when the difference $\epsilon_{\text{null\_text}} - \epsilon_{\text{null}}$ is computed.

Significance: Null context generation is central for achieving the orthogonality of style and context guidance; it allows CS-CFG to preserve dynamic or structural features of the source video even as style transfer operates at full strength.

3. Integration with Low-Rank Adapters in Self-Attention Mechanisms

PickStyle (Mehraban et al., 8 Oct 2025) demonstrates CS-CFG in conjunction with low-rank adapters inserted into self-attention layers of the video diffusion network’s conditioning blocks. In formal terms, for context tokens $Z_c$ , the adapters update the attention projections:

$Q'_c = Q_c + \Delta Q_c$
$K'_c = K_c + \Delta K_c$
$V'_c = V_c + \Delta V_c$

with

$\Delta Q_c = B_Q A_Q Z_c,\quad \Delta K_c = B_K A_K Z_c,\quad \Delta V_c = B_V A_V Z_c$

where $A_{\bullet} \in \mathbb{R}^{r \times d}$ , $B_{\bullet} \in \mathbb{R}^{d \times r}$ , $r \ll d$ is the rank parameter.

Significance: This modular integration allows the video backbone to efficiently specialize in motion-style transfer, leveraging static paired-image supervision while maintaining original temporal priors.

4. Algorithmic Structure and Forward Passes

CS-CFG usage entails three forward passes through the denoiser at each timestep:

$(\mathcal{T}, \mathcal{C})$ — full style+context
$(\varnothing, \mathcal{C})$ — context only, style dropped
$(\varnothing, \mathcal{C}_{\text{null}})$ — context nullified, style dropped

The style and context updates are then formed via difference terms:

$\Delta_\text{text} = \epsilon_{\text{cond}} - \epsilon_{\text{null\_text}}$
$\Delta_\text{context} = \epsilon_{\text{null\_text}} - \epsilon_{\text{null}}$

Allowing independent guidance coefficients for style and context, the final update linearly combines these components as shown above.

5. Empirical Validation and Performance

Ablation and benchmark studies in PickStyle (Mehraban et al., 8 Oct 2025) demonstrate that CS-CFG:

Outperforms standard classifier-free guidance in maintaining semantic consistency of video context during aggressive style transfer.
Achieves superior temporally coherent, style-faithful, and content-preserving translations on video benchmarks with both qualitative and quantitative gains (improved DreamSim, R Precision, and dynamic/visual quality metrics).
Mitigates mode collapse and frame-to-frame contextual drift observed with conventional CFG.

6. Theoretical and Practical Significance

The mathematical foundation of CS-CFG—its factorized guidance structure and construction of null context—provides a principled mechanism for controlling competing aspects of multimodal generation. By assigning independent update directions and coefficients, CS-CFG allows practitioners to manage the trade-off between style intensity and contextual fidelity. This is of particular significance in applications requiring temporal or spatial coherence (such as video, motion synthesis, or multimodal editing), where standard guidance can otherwise degrade context for the benefit of style transfer.

Moreover, modular integration with adapters and compatibility with training from synthetic paired data (generated via shared augmentations) makes CS-CFG broadly applicable in settings lacking large-scale paired video datasets.

7. Extensions and Future Directions

CS-CFG exemplifies a broader class of factorized guidance strategies that may be generalized beyond video-to-video style transfer to any multimodal generative modeling context—e.g., text-image-style fusion, audio-visual alignment, cross-domain translation. Open questions include:

Automated selection of permutation strategies for null context construction.
Extension to higher-order or non-linear guidance terms for more flexible style–context balancing.
Incorporating learned context–style factorization via contrastive principal components decompositions, as in related work on linear and non-linear CFG mechanisms (Li et al., 25 May 2025, Pavasovic et al., 11 Feb 2025).

More generally, CS-CFG provides a foundation for conditional generation systems where the preservation of both semantic content and artistic style is paramount, and where explicit disentanglement of guidance signals enhances both control and sample quality.