ReaSon: Causal Keyframe Selection
- The paper’s main contribution is the formulation of keyframe selection as a composite optimization that integrates predictive sufficiency with causal necessity using a novel Causal Information Bottleneck.
- ReaSon employs reinforcement learning with group-wise REINFORCE to optimize a composite reward that combines answer correctness, cycle-consistency, and counterfactual divergence.
- Empirical evaluations on multiple video QA benchmarks demonstrate state-of-the-art performance and robust generalization under strict frame budgets.
Reinforced Causal Search with Information Bottleneck (ReaSon) is a framework for keyframe selection in video understanding tasks with vision-LLMs (VLMs) under input-token and efficiency constraints. It formalizes keyframe selection as a composite optimization problem leveraging a novel Causal Information Bottleneck, reinforced learning of a selection policy, and counterfactual reasoning. ReaSon explicitly defines keyframes not only by informativeness (predictive sufficiency) but also by causal necessity, using tractable surrogates to optimize both criteria. Extensive benchmarks demonstrate state-of-the-art performance and improved generalization under stringent frame budgets (Zhou et al., 16 Nov 2025).
1. Causal Information Bottleneck Formalism
Traditional Information Bottleneck (IB) approaches select keyframes to maximize mutual information with the VLM output while compressing the input video-question pair : where sufficiency is expressed via . ReaSon introduces the Causal Information Bottleneck (CIB) to account for causal necessity, adding an interventional mutual information term: where quantifies the effect of interventions over the selected frames (using Pearl’s do-operator).
Because both terms are intractable in realistic VLMs, ReaSon employs:
- Sufficiency surrogate:
with approximated via a frozen VLM.
- Necessity surrogate:
where the counterfactual is sampled from the inverted policy .
The final practical formulation is: This objective operationalizes both predictive sufficiency and causal necessity in a tractable RL setting (Zhou et al., 16 Nov 2025).
2. Reinforcement-Learning Formulation for Keyframe Selection
ReaSon casts keyframe selection as a stateless, one-step Markov decision process:
- State: The candidate pool (frames filtered by visual relevance and open-vocab detectors) paired with question .
- Action: Select subset , , where sampling uses independent Bernoulli or top- multinomial policies.
- Transition: One-step, episode terminates with immediate reward.
- Policy network: Each is encoded via frozen BLIP; embeddings are aggregated by a 3-layer LSTM; an MLP produces frame logits, which are normalized to selection probabilities.
- RL algorithm: Group-wise REINFORCE (GPG) samples subsets and one counterfactual for each instance. Per-subset rewards are used to compute intra-group advantages:
Parameters are updated by
This RL paradigm enables direct optimization of the composite reward derived from CIB surrogates (Zhou et al., 16 Nov 2025).
3. Composite Reward Structure and CIB Alignment
ReaSon’s reward function rigorously aligns with the CIB objective, decomposing into three terms:
- Answer correctness (, proxy for ): Indicator for VLM answer correctness on selected frames.
- Cycle-consistency (): Intersection-over-union between visual-semantic elements inferred from the question and answer.
- Elements extracted from VLM uniform samples, from VLM answer prompts.
- Counterfactual KL (, proxy for ): KL-divergence between model outputs from selected and counterfactual subsets.
with counterfactual sampled from the inverted policy.
Parameter weights and (set to 0.5) balance semantic tightness and causal separation. This reward drive ensures both sufficiency and necessity are incentivized during training.
4. Policy Network and Training Protocol
Training follows the loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Input:
– Dataset {(v^n, q^n)}_{n=1}^N
– Frozen VLM, open-vocab detector, policy π_θ, λ₁=λ₂=0.5, G=4
for each (v, q) in training set do
1) Build candidate pool f←Detector(VLM(UniformSample(v), q))
2) for i=1 to G do
• Sample s_i∼π_θ(·|f, q)
• Logits o_i←VLM.logits(s_i, q)
• Answer a_i←VLM.generate(s_i, q)
• Extract E_q via VLM(Uniform(v), q)
• Extract E_{a,i} via VLM(a_i, q)
• Sample counterfactual s'_i∼%%%%0%%%%
• Logits o'_i←VLM.logits(s'_i, q)
• Compute
R_{ans}^i = 𝟙[a_i = gt]
R_{cycle}^i = IoU(E_q, E_{a,i})
R_{cf}^i = KL(softmax(o_i)||softmax(o'_i))
• R_i = R_{ans}^i + λ₁R_{cycle}^i + λ₂R_{cf}^i
end for
3) Compute advantages Â_i = R_i – mean_j R_j
4) Update θ via ∇θ ≈ (1/G)Σ_i Â_i ∇θ log π_θ(s_i|f, q)
end for |
5. Empirical Evaluation and Generalization
ReaSon’s empirical validation spans three video QA benchmarks and frame budgets ():
- NExT-QA: Short videos (analytic/temporal/causal queries).
- EgoSchema: 3-minute egocentric videos.
- Video-MME: Hour-scale videos.
Key results for frames:
| Dataset (Model) | SOTA (AKEYS) | ReaSon (best) |
|---|---|---|
| NExT-QA (GPT-4o) | 78.1% | 77.6% |
| NExT-QA (LLaVA-Video-7B) | - | 81.4% |
| NExT-QA (Qwen2.5-VL-7B) | - | 80.4% |
| EgoSchema (GPT-4o) | 68.6% | 72.2% |
On Video-MME (K=8, GPT-4o), baseline is 53.8%, ReaSon achieves 59.1% (+2.6pts); gains are most pronounced on short clips.
For K=32, ReaSon/GPT-4o achieves 66.4% vs. 61.8% baseline.
Ablation studies confirm:
- Answer-only (): 80.1%/66.0%.
- Cycle-consistency (): 80.5%/68.2%.
- Counterfactual (): 81.4%/69.0%.
These outcomes substantiate that integrating both sufficiency and necessity rewards is crucial to maximum performance.
Across all experiments and VLMs, ReaSon enhances performance by 1–5 points, confirming broad generalization and robustness under token constraints.
6. Significance and Conceptual Advances
ReaSon establishes a principled, tractable implementation of the Causal Information Bottleneck for keyframe selection in video understanding. It bridges information-theoretic sufficiency and causal necessity with RL and modern VLMs, using composite rewards and counterfactual interventions. The framework’s generalization to varying VLM backbones, clip lengths, and question types demonstrates its utility under strict input budgets, making it a reference approach for future video-VLM integration under realistic deployment constraints (Zhou et al., 16 Nov 2025).