Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReaSon: Causal Keyframe Selection

Updated 23 November 2025
  • The paper’s main contribution is the formulation of keyframe selection as a composite optimization that integrates predictive sufficiency with causal necessity using a novel Causal Information Bottleneck.
  • ReaSon employs reinforcement learning with group-wise REINFORCE to optimize a composite reward that combines answer correctness, cycle-consistency, and counterfactual divergence.
  • Empirical evaluations on multiple video QA benchmarks demonstrate state-of-the-art performance and robust generalization under strict frame budgets.

Reinforced Causal Search with Information Bottleneck (ReaSon) is a framework for keyframe selection in video understanding tasks with vision-LLMs (VLMs) under input-token and efficiency constraints. It formalizes keyframe selection as a composite optimization problem leveraging a novel Causal Information Bottleneck, reinforced learning of a selection policy, and counterfactual reasoning. ReaSon explicitly defines keyframes not only by informativeness (predictive sufficiency) but also by causal necessity, using tractable surrogates to optimize both criteria. Extensive benchmarks demonstrate state-of-the-art performance and improved generalization under stringent frame budgets (Zhou et al., 16 Nov 2025).

1. Causal Information Bottleneck Formalism

Traditional Information Bottleneck (IB) approaches select keyframes SS to maximize mutual information with the VLM output OO while compressing the input video-question pair (V,Q)(V, Q): maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta where sufficiency is expressed via I(S;O)I(S; O). ReaSon introduces the Causal Information Bottleneck (CIB) to account for causal necessity, adding an interventional mutual information term: maxp(sv,q)  I(S;O)+Ic ⁣(O;do(S))s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O) + I_c\!\bigl(O;\mathrm{do}(S)\bigr)\quad\text{s.t.}\;I(V,Q;S)\le\beta where Ic(O;do(S))I_c\bigl(O;\mathrm{do}(S)\bigr) quantifies the effect of interventions over the selected frames (using Pearl’s do-operator).

Because both terms are intractable in realistic VLMs, ReaSon employs:

  • Sufficiency surrogate:

J1(s)=Ep(os)[logqϕ(os)]J_1(s) = \mathbb{E}_{p(o|s)}[\log q_\phi(o|s)]

with qϕ(os)q_\phi(o|s) approximated via a frozen VLM.

  • Necessity surrogate:

J2(s,s)=DKL(p(os)p(os))J_2(s,s') = D_{KL}(p(o|s) \Vert p(o|s'))

where the counterfactual OO0 is sampled from the inverted policy OO1.

The final practical formulation is: OO2 This objective operationalizes both predictive sufficiency and causal necessity in a tractable RL setting (Zhou et al., 16 Nov 2025).

2. Reinforcement-Learning Formulation for Keyframe Selection

ReaSon casts keyframe selection as a stateless, one-step Markov decision process:

  • State: The candidate pool OO3 (frames filtered by visual relevance and open-vocab detectors) paired with question OO4.
  • Action: Select subset OO5, OO6, where sampling uses independent Bernoulli or top-OO7 multinomial policies.
  • Transition: One-step, episode terminates with immediate reward.
  • Policy network: Each OO8 is encoded via frozen BLIP; embeddings are aggregated by a 3-layer LSTM; an MLP produces frame logits, which are normalized to selection probabilities.
  • RL algorithm: Group-wise REINFORCE (GPG) samples OO9 subsets and one counterfactual for each instance. Per-subset rewards (V,Q)(V, Q)0 are used to compute intra-group advantages:

(V,Q)(V, Q)1

Parameters are updated by

(V,Q)(V, Q)2

This RL paradigm enables direct optimization of the composite reward derived from CIB surrogates (Zhou et al., 16 Nov 2025).

3. Composite Reward Structure and CIB Alignment

ReaSon’s reward function rigorously aligns with the CIB objective, decomposing into three terms: (V,Q)(V, Q)3

  • Answer correctness ((V,Q)(V, Q)4, proxy for (V,Q)(V, Q)5): Indicator for VLM answer correctness on selected frames.

(V,Q)(V, Q)6

  • Cycle-consistency ((V,Q)(V, Q)7): Intersection-over-union between visual-semantic elements inferred from the question and answer.
    • Elements (V,Q)(V, Q)8 extracted from VLM uniform samples, (V,Q)(V, Q)9 from VLM answer prompts.

maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta0

  • Counterfactual KL (maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta1, proxy for maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta2): KL-divergence between model outputs from selected and counterfactual subsets.

maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta3

with counterfactual maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta4 sampled from the inverted policy.

Parameter weights maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta5 and maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta6 (set to 0.5) balance semantic tightness and causal separation. This reward drive ensures both sufficiency and necessity are incentivized during training.

4. Policy Network and Training Protocol

Training follows the loop:

I(S;O)I(S; O)2 This protocol leverages a frozen VLM for all surrogates, uses open-vocab detectors to construct frame pools, and employs intra-group advantages for sample efficiency and variance reduction.

5. Empirical Evaluation and Generalization

ReaSon’s empirical validation spans three video QA benchmarks and frame budgets (maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta7):

  • NExT-QA: Short videos (analytic/temporal/causal queries).
  • EgoSchema: 3-minute egocentric videos.
  • Video-MME: Hour-scale videos.

Key results for maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta8 frames:

Dataset (Model) SOTA (AKEYS) ReaSon (best)
NExT-QA (GPT-4o) 78.1% 77.6%
NExT-QA (LLaVA-Video-7B) - 81.4%
NExT-QA (Qwen2.5-VL-7B) - 80.4%
EgoSchema (GPT-4o) 68.6% 72.2%

On Video-MME (K=8, GPT-4o), baseline is 53.8%, ReaSon achieves 59.1% (+2.6pts); gains are most pronounced on short clips.

For K=32, ReaSon/GPT-4o achieves 66.4% vs. 61.8% baseline.

Ablation studies confirm:

  • Answer-only (maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta9): 80.1%/66.0%.
    • Cycle-consistency (I(S;O)I(S; O)0): 80.5%/68.2%.
    • Counterfactual (I(S;O)I(S; O)1): 81.4%/69.0%.

These outcomes substantiate that integrating both sufficiency and necessity rewards is crucial to maximum performance.

Across all experiments and VLMs, ReaSon enhances performance by 1–5 points, confirming broad generalization and robustness under token constraints.

6. Significance and Conceptual Advances

ReaSon establishes a principled, tractable implementation of the Causal Information Bottleneck for keyframe selection in video understanding. It bridges information-theoretic sufficiency and causal necessity with RL and modern VLMs, using composite rewards and counterfactual interventions. The framework’s generalization to varying VLM backbones, clip lengths, and question types demonstrates its utility under strict input budgets, making it a reference approach for future video-VLM integration under realistic deployment constraints (Zhou et al., 16 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reinforced Causal Search with Information Bottleneck (ReaSon).