Papers
Topics
Authors
Recent
2000 character limit reached

ReaSon: Causal Keyframe Selection

Updated 23 November 2025
  • The paper’s main contribution is the formulation of keyframe selection as a composite optimization that integrates predictive sufficiency with causal necessity using a novel Causal Information Bottleneck.
  • ReaSon employs reinforcement learning with group-wise REINFORCE to optimize a composite reward that combines answer correctness, cycle-consistency, and counterfactual divergence.
  • Empirical evaluations on multiple video QA benchmarks demonstrate state-of-the-art performance and robust generalization under strict frame budgets.

Reinforced Causal Search with Information Bottleneck (ReaSon) is a framework for keyframe selection in video understanding tasks with vision-LLMs (VLMs) under input-token and efficiency constraints. It formalizes keyframe selection as a composite optimization problem leveraging a novel Causal Information Bottleneck, reinforced learning of a selection policy, and counterfactual reasoning. ReaSon explicitly defines keyframes not only by informativeness (predictive sufficiency) but also by causal necessity, using tractable surrogates to optimize both criteria. Extensive benchmarks demonstrate state-of-the-art performance and improved generalization under stringent frame budgets (Zhou et al., 16 Nov 2025).

1. Causal Information Bottleneck Formalism

Traditional Information Bottleneck (IB) approaches select keyframes SS to maximize mutual information with the VLM output OO while compressing the input video-question pair (V,Q)(V, Q): maxp(sv,q)  I(S;O)s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O)\quad \text{s.t.}\;I(V,Q;S)\le\beta where sufficiency is expressed via I(S;O)I(S; O). ReaSon introduces the Causal Information Bottleneck (CIB) to account for causal necessity, adding an interventional mutual information term: maxp(sv,q)  I(S;O)+Ic ⁣(O;do(S))s.t.  I(V,Q;S)β\max_{p(s\mid v,q)}\; I(S;O) + I_c\!\bigl(O;\mathrm{do}(S)\bigr)\quad\text{s.t.}\;I(V,Q;S)\le\beta where Ic(O;do(S))I_c\bigl(O;\mathrm{do}(S)\bigr) quantifies the effect of interventions over the selected frames (using Pearl’s do-operator).

Because both terms are intractable in realistic VLMs, ReaSon employs:

  • Sufficiency surrogate:

J1(s)=Ep(os)[logqϕ(os)]J_1(s) = \mathbb{E}_{p(o|s)}[\log q_\phi(o|s)]

with qϕ(os)q_\phi(o|s) approximated via a frozen VLM.

  • Necessity surrogate:

J2(s,s)=DKL(p(os)p(os))J_2(s,s') = D_{KL}(p(o|s) \Vert p(o|s'))

where the counterfactual ss' is sampled from the inverted policy π~(fi)1πθ(fi)\tilde\pi(f_i)\propto 1-\pi_\theta(f_i).

The final practical formulation is: maxπθEsπθ,sπ~[J1(s)+J2(s,s)]s.t.  sK\max_{\pi_\theta} \mathbb{E}_{s \sim \pi_\theta, s' \sim \tilde\pi}\Bigl[J_1(s) + J_2(s, s')\Bigr]\quad \text{s.t.}\;|s| \le K This objective operationalizes both predictive sufficiency and causal necessity in a tractable RL setting (Zhou et al., 16 Nov 2025).

2. Reinforcement-Learning Formulation for Keyframe Selection

ReaSon casts keyframe selection as a stateless, one-step Markov decision process:

  • State: The candidate pool f={f1,,fM}f = \{f_1, \dots, f_M\} (frames filtered by visual relevance and open-vocab detectors) paired with question qq.
  • Action: Select subset sfs \subset f, sK|s| \le K, where sampling uses independent Bernoulli or top-KK multinomial policies.
  • Transition: One-step, episode terminates with immediate reward.
  • Policy network: Each (fi,q)(f_i, q) is encoded via frozen BLIP; embeddings are aggregated by a 3-layer LSTM; an MLP produces frame logits, which are normalized to selection probabilities.
  • RL algorithm: Group-wise REINFORCE (GPG) samples GG subsets and one counterfactual for each instance. Per-subset rewards RiR_i are used to compute intra-group advantages:

A^i=Ri1Gj=1GRj\hat A_i = R_i - \frac{1}{G}\sum_{j=1}^G R_j

Parameters are updated by

θL=1Gi=1GA^iθlogπθ(sif,q)\nabla_\theta \mathcal{L} = \frac{1}{G} \sum_{i=1}^G \hat A_i \nabla_\theta \log \pi_\theta(s_i|f,q)

This RL paradigm enables direct optimization of the composite reward derived from CIB surrogates (Zhou et al., 16 Nov 2025).

3. Composite Reward Structure and CIB Alignment

ReaSon’s reward function rigorously aligns with the CIB objective, decomposing into three terms: R=Rans+λ1Rcycle+λ2RcfR = R_{\text{ans}} + \lambda_1 R_{\text{cycle}} + \lambda_2 R_{\text{cf}}

  • Answer correctness (RansR_{\text{ans}}, proxy for J1J_1): Indicator for VLM answer correctness on selected frames.

Rans=I[VLM(s,q)=ground-truth]R_{\text{ans}} = \mathbb{I}[\text{VLM}(s,q) = \text{ground-truth}]

  • Cycle-consistency (RcycleR_{\text{cycle}}): Intersection-over-union between visual-semantic elements inferred from the question and answer.
    • Elements EqE_q extracted from VLM uniform samples, EaE_a from VLM answer prompts.

Rcycle=IoU(Eq,Ea)R_{\text{cycle}} = \mathrm{IoU}(E_q, E_a)

  • Counterfactual KL (RcfR_{\text{cf}}, proxy for J2J_2): KL-divergence between model outputs from selected and counterfactual subsets.

Rcf=DKL(softmax(o)softmax(o))R_{\text{cf}} = D_{KL}\bigl(\text{softmax}(o) \Vert \text{softmax}(o')\bigr)

with counterfactual ss' sampled from the inverted policy.

Parameter weights λ1\lambda_1 and λ2\lambda_2 (set to 0.5) balance semantic tightness and causal separation. This reward drive ensures both sufficiency and necessity are incentivized during training.

4. Policy Network and Training Protocol

Training follows the loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Input:
  – Dataset {(v^n, q^n)}_{n=1}^N
  – Frozen VLM, open-vocab detector, policy π_θ, λ₁=λ₂=0.5, G=4
for each (v, q) in training set do
  1) Build candidate pool f←Detector(VLM(UniformSample(v), q))
  2) for i=1 to G do
       •  Sample s_i∼π_θ(·|f, q)
       •  Logits o_i←VLM.logits(s_i, q)
       •  Answer a_i←VLM.generate(s_i, q)
       •  Extract E_q via VLM(Uniform(v), q)
       •  Extract E_{a,i} via VLM(a_i, q)
       •  Sample counterfactual s'_i∼%%%%0%%%%
       •  Logits o'_i←VLM.logits(s'_i, q)
       •  Compute
          R_{ans}^i = 𝟙[a_i = gt]
          R_{cycle}^i = IoU(E_q, E_{a,i})
          R_{cf}^i    = KL(softmax(o_i)||softmax(o'_i))
       •  R_i = R_{ans}^i + λ₁R_{cycle}^i + λ₂R_{cf}^i
     end for
  3) Compute advantages Â_i = R_i – mean_j R_j
  4) Update θ via ∇θ ≈ (1/G)Σ_i Â_i ∇θ log π_θ(s_i|f, q)
end for
This protocol leverages a frozen VLM for all surrogates, uses open-vocab detectors to construct frame pools, and employs intra-group advantages for sample efficiency and variance reduction.

5. Empirical Evaluation and Generalization

ReaSon’s empirical validation spans three video QA benchmarks and frame budgets (K=8,32K=8, 32):

  • NExT-QA: Short videos (analytic/temporal/causal queries).
  • EgoSchema: 3-minute egocentric videos.
  • Video-MME: Hour-scale videos.

Key results for K=8K=8 frames:

Dataset (Model) SOTA (AKEYS) ReaSon (best)
NExT-QA (GPT-4o) 78.1% 77.6%
NExT-QA (LLaVA-Video-7B) - 81.4%
NExT-QA (Qwen2.5-VL-7B) - 80.4%
EgoSchema (GPT-4o) 68.6% 72.2%

On Video-MME (K=8, GPT-4o), baseline is 53.8%, ReaSon achieves 59.1% (+2.6pts); gains are most pronounced on short clips.

For K=32, ReaSon/GPT-4o achieves 66.4% vs. 61.8% baseline.

Ablation studies confirm:

  • Answer-only (RansR_{\text{ans}}): 80.1%/66.0%.
    • Cycle-consistency (Rans+RcycleR_{\text{ans}} + R_{\text{cycle}}): 80.5%/68.2%.
    • Counterfactual (+Rcf+R_{\text{cf}}): 81.4%/69.0%.

These outcomes substantiate that integrating both sufficiency and necessity rewards is crucial to maximum performance.

Across all experiments and VLMs, ReaSon enhances performance by 1–5 points, confirming broad generalization and robustness under token constraints.

6. Significance and Conceptual Advances

ReaSon establishes a principled, tractable implementation of the Causal Information Bottleneck for keyframe selection in video understanding. It bridges information-theoretic sufficiency and causal necessity with RL and modern VLMs, using composite rewards and counterfactual interventions. The framework’s generalization to varying VLM backbones, clip lengths, and question types demonstrates its utility under strict input budgets, making it a reference approach for future video-VLM integration under realistic deployment constraints (Zhou et al., 16 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Reinforced Causal Search with Information Bottleneck (ReaSon).