Causally Steered Diffusion for Automated Video Counterfactual Generation (2506.14404v1)

Published 17 Jun 2025 in cs.CV and cs.AI

Abstract: Adapting text-to-image (T2I) latent diffusion models for video editing has shown strong visual fidelity and controllability, but challenges remain in maintaining causal relationships in video content. Edits affecting causally dependent attributes risk generating unrealistic or misleading outcomes if these relationships are ignored. In this work, we propose a causally faithful framework for counterfactual video generation, guided by a vision-LLM (VLM). Our method is agnostic to the underlying video editing system and does not require access to its internal mechanisms or finetuning. Instead, we guide the generation by optimizing text prompts based on an assumed causal graph, addressing the challenge of latent space control in LDMs. We evaluate our approach using standard video quality metrics and counterfactual-specific criteria, such as causal effectiveness and minimality. Our results demonstrate that causally faithful video counterfactuals can be effectively generated within the learned distribution of LDMs through prompt-based causal steering. With its compatibility with any black-box video editing system, our method holds significant potential for generating realistic "what-if" video scenarios in diverse areas such as healthcare and digital media.

Summary

The paper introduces a novel framework that uses iterative VLM-guided prompt optimization to generate video counterfactuals respecting causal relationships.
It leverages textual gradient descent and causal decoupling to adjust black-box video models, significantly improving intervention effectiveness.
The approach maintains minimal changes to non-target attributes while enabling realistic 'what-if' video scenarios across various editing backbones.

This paper introduces a novel framework called "Causally Steered Diffusion" for generating video counterfactuals that respect causal relationships between attributes. The core problem addressed is that standard video editing diffusion models, while powerful, can produce unrealistic or misleading results when an edit to one attribute (e.g., making a person younger) is not causally consistent with other related attributes (e.g., presence of a beard, which might be less likely in a very young person according to a causal model).

The proposed method steers existing, black-box video editing diffusion models by optimizing the input text prompt. This optimization is guided by a Vision-LLM (VLM) that evaluates the generated video frames against the desired counterfactual and a predefined causal graph. The key idea is to use textual feedback from the VLM to iteratively refine the prompt, pushing the video generation towards a causally faithful outcome without needing to modify or fine-tune the underlying video editing model.

Core Mechanism: VLM-Guided Prompt Optimization

Black-Box Video Editing: The system treats any text-guided video editing model (e.g., FLATTEN, Tune-A-Video, TokenFlow) as a black box function: $\mathcal{V'} = f(\mathcal{V}, \mathcal{P})$ , where $\mathcal{V}$ is the input video, $\mathcal{P}$ is the text prompt, and $\mathcal{V'}$ is the edited video.
VLM-based Counterfactual Loss: A VLM is used to assess how well a generated frame $\mathcal{V'_{\mathit{frame}}}$ aligns with the target counterfactual interventions described in the current prompt $\mathcal{P}$ . This is formulated as a "multimodal loss":

$\mathcal{L} = VLM(\mathcal{V'_{\mathit{frame}, evaluation\_instruction, \mathcal{P})$

The evaluation_instruction is a carefully crafted prompt given to the VLM, detailing what to look for based on the target interventions. Crucially, this instruction can be augmented with a "causal decoupling" textual input. This instructs the VLM to simulate causal graph mutilation (e.g., when intervening on a downstream variable like "beard", ignore upstream variables like "gender" if the goal is to add a beard to a woman, thereby breaking the typical causal link).
Textual Gradient Descent (TGD): The textual feedback (criticisms and suggestions) from the VLM, representing $\mathcal{L}$ , is used to compute a "textual gradient" $\frac{\partial \mathcal{L}}{\partial \mathcal{P}}$ . This textual gradient is then used by an LLM to update the prompt $\mathcal{P}$ :

%%%%10%%%%

This process is iterated, refining the prompt until the VLM indicates "no optimization is needed" or a maximum number of iterations is reached.

The overall algorithm (Algorithm 1) initializes with a counterfactual prompt, and in each iteration, generates a counterfactual video using the current prompt, evaluates a frame using the VLM to get a loss (textual feedback), computes the textual gradient, and updates the prompt.

Algorithm 1: Causally Steered Diffusion
-----------------------------------------
Inputs: Initial counterfactual prompt P_init, Factual Video V, DiffusionVideoEditor, VLM

1. prompt = P_init
2. optimizer = TextualGradientDescent(parameters=[prompt])
3. For iter in max_iterations:
4.     V_counterfactual = DiffusionVideoEditor(V, prompt)
5.     V_frame = extract_frame(V_counterfactual)
6.     loss_feedback = VLM(V_frame, evaluation_instruction, prompt)  // VLM evaluation
7.
8.     If "no optimization is needed" in loss_feedback:
9.         break
10.    textual_gradient = compute_textual_gradient(loss_feedback) // VLM's criticisms
11.    prompt = optimizer.step(prompt, textual_gradient)        // LLM refines prompt
12.
13. Return V_counterfactual

Evaluation

The framework is evaluated on its ability to generate causally faithful video counterfactuals, focusing on:

Causal Effectiveness: Measures if the target intervention was successfully applied. This is assessed using a VLM in a visual question answering (VQA) setup. Given a generated frame, a multiple-choice question about the intervened attribute (e.g., "What is the age of the person? a) young b) old"), and the correct answer from the target prompt, effectiveness is the VLM's accuracy:

$\mathit{Effectiveness}(\alpha) = \frac{1}{N} \sum_{i=1}^{N} \mathds{1} \left[ VLM(\mathcal{V'_{\mathit{frame_{i}, Q^{\alpha}_{i}) = C_i \right]$
Minimality: Assesses if only the intended attributes were changed, while other unrelated attributes (not part of the causal graph) remained preserved. This is evaluated in the text domain. A VLM describes both factual and counterfactual frames with instructions to exclude attributes from the causal graph. The BERT-based sentence embeddings of these descriptions are then compared using cosine similarity:

$\mathcal{P}_{min} = \text{"Describe this frame in detail, exclude causal graph variables"}$

$Minimality(\mathcal{V_{\mathit{frame}, \mathcal{V'_{\mathit{frame}) = cos(\tau_\phi(VLM(\mathcal{V_{\mathit{frame}, \mathcal{P}_{min})), \tau_\phi(VLM(\mathcal{V'_{\mathit{frame}, \mathcal{P}_{min})))$

where $\tau_\phi(\cdot)$ is the semantic encoder.

Standard video quality metrics (DOVER for aesthetic and technical quality, FVD for distributional similarity, CLIP-Temp for temporal consistency) are also used.

Implementation and Experiments

Dataset: 67 text-video pairs from CelebV-Text, with interventions on "age", "gender", "beard", and "baldness", based on an assumed causal graph (e.g., Gender $\rightarrow$ Beard, Age $\rightarrow$ Beard, Age $\rightarrow$ Bald).
Video Editing Backbones: FLATTEN, Tune-A-Video, and TokenFlow, all using Stable Diffusion v2.1. These are treated as black boxes.
VLMs/LLMs:
- GPT-4o for the VLM counterfactual loss (Eq. \ref{eq:loss}) and for the Textual Gradient Descent update step (Eq. \ref{TGD_update}).
- LLaVA-NeXT for the VLM effectiveness metric (Eq. \ref{eq:effectiveness}).
- GPT-4o for the VLM minimality metric (Eq. \ref{eq:minimality}), chosen for its ability to filter descriptions.
Optimization: 2 TGD iterations.
Comparisons: The proposed method (with and without causal decoupling in the VLM loss) is compared against vanilla video editing methods (using initial unoptimized prompts) and a naive LLM-based paraphrasing baseline.

Key Results

Improved Causal Effectiveness: The VLM-steered prompt optimization significantly improves the effectiveness of interventions across all tested video editing backbones, especially for challenging edits that break common correlations (e.g., adding a beard to a female). The "VLM loss w/ causal dec" (with causal decoupling) variant generally performs best in these scenarios.
Maintained Minimality and Quality: While effective edits can sometimes slightly increase perceptual differences (LPIPS) or reduce VLM-based minimality scores, the proposed method largely maintains minimality comparable to baselines. General video quality (DOVER, FVD) and temporal consistency (CLIP-Temp) are not significantly compromised.
Qualitative Improvements: Visual examples show the method successfully generating desired counterfactuals, such as making an older woman young, adding a beard to a woman, or transforming gender, often outperforming unoptimized prompts or simple LLM paraphrasing. The paper also shows the progressive refinement of the video output as the prompt is optimized over TGD steps.

Practical Implications and Limitations

Model Agnostic Control: The framework offers a way to enhance causal faithfulness in any black-box text-to-video editing system without retraining or fine-tuning, relying solely on prompt engineering guided by VLMs.
Controllable "What-If" Scenarios: It enables the generation of more realistic and causally plausible "what-if" video scenarios, which is valuable in fields like healthcare (simulating treatment outcomes), education, and digital media.
Computational Cost: The iterative nature, involving multiple calls to large VLMs/LLMs and the video generation model per optimization step, can be computationally intensive.
Dependence on Causal Graph and VLM: The quality of results depends on the accuracy of the assumed causal graph and the VLM's ability to evaluate visual attributes and provide useful textual feedback.
Static Attribute Focus: The current work focuses on static attributes. Manipulating temporal attributes (actions, dynamic scenes) and building corresponding causal graphs are noted as future work.
Temporal Consistency: While temporal consistency is measured, the method doesn't explicitly add new mechanisms to enforce it beyond what the underlying video editing model provides.

Appendix Insights

The appendix provides further implementation details:

Dataset Construction: Explains how factual prompts from CelebV-Text were used to derive counterfactual target prompts based on interventions on the causal graph (Age, Gender, Beard, Baldness).
Prompt Structures: Gives examples of the evaluation_instruction for the VLM loss, including the causal_decoupling_prompt (e.g., "If either beard or bald appears in target_interventions, do not include references to age or gender.").
VLM Feedback Example: Shows an example of the VLM's textual feedback (criticism) and the derived "textual gradient" used to update the prompt.
VLM Evaluation Pipelines: Illustrates the VQA pipeline for effectiveness and the descriptive pipeline for minimality, including the prompt used to instruct the VLM to filter descriptions for the minimality metric.

In summary, the paper presents a practical method to steer diffusion-based video editing models towards causally sound counterfactuals by iteratively optimizing input prompts using textual feedback from a VLM. This approach is powerful as it does not require access to the internals of the video editing model, making it broadly applicable.