ReaSon: Causal Keyframe Selection

Updated 23 November 2025

The paper’s main contribution is the formulation of keyframe selection as a composite optimization that integrates predictive sufficiency with causal necessity using a novel Causal Information Bottleneck.
ReaSon employs reinforcement learning with group-wise REINFORCE to optimize a composite reward that combines answer correctness, cycle-consistency, and counterfactual divergence.
Empirical evaluations on multiple video QA benchmarks demonstrate state-of-the-art performance and robust generalization under strict frame budgets.

Reinforced Causal Search with Information Bottleneck (ReaSon) is a framework for keyframe selection in video understanding tasks with vision-LLMs (VLMs) under input-token and efficiency constraints. It formalizes keyframe selection as a composite optimization problem leveraging a novel Causal Information Bottleneck, reinforced learning of a selection policy, and counterfactual reasoning. ReaSon explicitly defines keyframes not only by informativeness (predictive sufficiency) but also by causal necessity, using tractable surrogates to optimize both criteria. Extensive benchmarks demonstrate state-of-the-art performance and improved generalization under stringent frame budgets (Zhou et al., 16 Nov 2025).

1. Causal Information Bottleneck Formalism

Traditional Information Bottleneck (IB) approaches select keyframes $S$ to maximize mutual information with the VLM output $O$ while compressing the input video-question pair $(V, Q)$ : $\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta$ where sufficiency is expressed via $I(S; O)$ . ReaSon introduces the Causal Information Bottleneck (CIB) to account for causal necessity, adding an interventional mutual information term: $\max_{p(s\mid v,q)}\; I(S;O) + I_c\!\bigl(O;\mathrm{do}(S)\bigr)\quad\text{s.t.}\;I(V,Q;S)\le\beta$ where $I_c\bigl(O;\mathrm{do}(S)\bigr)$ quantifies the effect of interventions over the selected frames (using Pearl’s do-operator).

Because both terms are intractable in realistic VLMs, ReaSon employs:

Sufficiency surrogate:

$J_1(s) = \mathbb{E}_{p(o|s)}[\log q_\phi(o|s)]$

with $q_\phi(o|s)$ approximated via a frozen VLM.

Necessity surrogate:

$J_2(s,s') = D_{KL}(p(o|s) \Vert p(o|s'))$

where the counterfactual $s'$ is sampled from the inverted policy $\tilde\pi(f_i)\propto 1-\pi_\theta(f_i)$ .

The final practical formulation is: $\max_{\pi_\theta} \mathbb{E}_{s \sim \pi_\theta, s' \sim \tilde\pi}\Bigl[J_1(s) + J_2(s, s')\Bigr]\quad \text{s.t.}\;|s| \le K$ This objective operationalizes both predictive sufficiency and causal necessity in a tractable RL setting (Zhou et al., 16 Nov 2025).

2. Reinforcement-Learning Formulation for Keyframe Selection

ReaSon casts keyframe selection as a stateless, one-step Markov decision process:

State: The candidate pool $f = \{f_1, \dots, f_M\}$ (frames filtered by visual relevance and open-vocab detectors) paired with question $q$ .
Action: Select subset $s \subset f$ , $|s| \le K$ , where sampling uses independent Bernoulli or top- $K$ multinomial policies.
Transition: One-step, episode terminates with immediate reward.
Policy network: Each $(f_i, q)$ is encoded via frozen BLIP; embeddings are aggregated by a 3-layer LSTM; an MLP produces frame logits, which are normalized to selection probabilities.
RL algorithm: Group-wise REINFORCE (GPG) samples $G$ subsets and one counterfactual for each instance. Per-subset rewards $R_i$ are used to compute intra-group advantages:

$\hat A_i = R_i - \frac{1}{G}\sum_{j=1}^G R_j$

Parameters are updated by

$\nabla_\theta \mathcal{L} = \frac{1}{G} \sum_{i=1}^G \hat A_i \nabla_\theta \log \pi_\theta(s_i|f,q)$

This RL paradigm enables direct optimization of the composite reward derived from CIB surrogates (Zhou et al., 16 Nov 2025).

3. Composite Reward Structure and CIB Alignment

ReaSon’s reward function rigorously aligns with the CIB objective, decomposing into three terms: $R = R_{\text{ans}} + \lambda_1 R_{\text{cycle}} + \lambda_2 R_{\text{cf}}$

Answer correctness ( $R_{\text{ans}}$ , proxy for $J_1$ ): Indicator for VLM answer correctness on selected frames.

$R_{\text{ans}} = \mathbb{I}[\text{VLM}(s,q) = \text{ground-truth}]$

Cycle-consistency ( $R_{\text{cycle}}$ $R_{cycle}$ ): Intersection-over-union between visual-semantic elements inferred from the question and answer.
- Elements $E_q$ extracted from VLM uniform samples, $E_a$ from VLM answer prompts.

$R_{\text{cycle}} = \mathrm{IoU}(E_q, E_a)$

Counterfactual KL ( $R_{\text{cf}}$ , proxy for $J_2$ ): KL-divergence between model outputs from selected and counterfactual subsets.

$R_{\text{cf}} = D_{KL}\bigl(\text{softmax}(o) \Vert \text{softmax}(o')\bigr)$

with counterfactual $s'$ sampled from the inverted policy.

Parameter weights $\lambda_1$ and $\lambda_2$ (set to 0.5) balance semantic tightness and causal separation. This reward drive ensures both sufficiency and necessity are incentivized during training.

4. Policy Network and Training Protocol

Training follows the loop:

Input:
  – Dataset {(v^n, q^n)}_{n=1}^N
  – Frozen VLM, open-vocab detector, policy π_θ, λ₁=λ₂=0.5, G=4
for each (v, q) in training set do
  1) Build candidate pool f←Detector(VLM(UniformSample(v), q))
  2) for i=1 to G do
       •  Sample s_i∼π_θ(·|f, q)
       •  Logits o_i←VLM.logits(s_i, q)
       •  Answer a_i←VLM.generate(s_i, q)
       •  Extract E_q via VLM(Uniform(v), q)
       •  Extract E_{a,i} via VLM(a_i, q)
       •  Sample counterfactual s'_i∼%%%%0%%%%
       •  Logits o'_i←VLM.logits(s'_i, q)
       •  Compute
          R_{ans}^i = 𝟙[a_i = gt]
          R_{cycle}^i = IoU(E_q, E_{a,i})
          R_{cf}^i    = KL(softmax(o_i)||softmax(o'_i))
       •  R_i = R_{ans}^i + λ₁R_{cycle}^i + λ₂R_{cf}^i
     end for
  3) Compute advantages Â_i = R_i – mean_j R_j
  4) Update θ via ∇θ ≈ (1/G)Σ_i Â_i ∇θ log π_θ(s_i|f, q)
end for

This protocol leverages a frozen VLM for all surrogates, uses open-vocab detectors to construct frame pools, and employs intra-group advantages for sample efficiency and variance reduction.

5. Empirical Evaluation and Generalization

ReaSon’s empirical validation spans three video QA benchmarks and frame budgets ( $K=8, 32$ ):

NExT-QA: Short videos (analytic/temporal/causal queries).
EgoSchema: 3-minute egocentric videos.
Video-MME: Hour-scale videos.

Key results for $K=8$ frames:

Dataset (Model)	SOTA (AKEYS)	ReaSon (best)
NExT-QA (GPT-4o)	78.1%	77.6%
NExT-QA (LLaVA-Video-7B)	-	81.4%
NExT-QA (Qwen2.5-VL-7B)	-	80.4%
EgoSchema (GPT-4o)	68.6%	72.2%

On Video-MME (K=8, GPT-4o), baseline is 53.8%, ReaSon achieves 59.1% (+2.6pts); gains are most pronounced on short clips.

For K=32, ReaSon/GPT-4o achieves 66.4% vs. 61.8% baseline.

Ablation studies confirm:

Answer-only ( $R_{\text{ans}}$ ): 80.1%/66.0%.
- Cycle-consistency ( $R_{\text{ans}} + R_{\text{cycle}}$ ): 80.5%/68.2%.
- Counterfactual ( $+R_{\text{cf}}$ ): 81.4%/69.0%.

These outcomes substantiate that integrating both sufficiency and necessity rewards is crucial to maximum performance.

Across all experiments and VLMs, ReaSon enhances performance by 1–5 points, confirming broad generalization and robustness under token constraints.

6. Significance and Conceptual Advances

ReaSon establishes a principled, tractable implementation of the Causal Information Bottleneck for keyframe selection in video understanding. It bridges information-theoretic sufficiency and causal necessity with RL and modern VLMs, using composite rewards and counterfactual interventions. The framework’s generalization to varying VLM backbones, clip lengths, and question types demonstrates its utility under strict input budgets, making it a reference approach for future video-VLM integration under realistic deployment constraints (Zhou et al., 16 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Reinforced Causal Search with Information Bottleneck (ReaSon).