Causality-Aware Temporal Projector (CATP)
- CATP is a video understanding module applying temporal constraints to enhance causality perception.
- Alternative approaches allow backward influence which CATP overcomes using a block-causal attention mask.
- CATP offers measurable gains in benchmark accuracy, enhancing temporal and causal reasoning capabilities.
The Causality-Aware Temporal Projector (CATP) is a transformer-based module designed for video understanding within Video LLMs (Video-LLMs) that require precise modeling of temporal ordering and causality. CATP was proposed in the V-CORE framework and explicitly constrains temporal information flow, addressing limitations of standard bidirectional attention mechanisms in existing video-language architectures. CATP achieves parameter efficiency, maintains unbroken intra-frame spatial interactions, and provides measurable improvements on benchmarks requiring temporal and causal reasoning (Kang et al., 5 Jan 2026).
1. Motivation and Challenges in Temporal Video Modeling
Many Video-LLMs utilize bidirectional transformer projectors to capture inter-frame dependencies. While effective for general multimodal reasoning, these projectors allow future frame features to influence earlier event representations, undermining the “arrow of time” crucial for causal questions, e.g., “what happened first?” or “which event caused X?”. This bidirectional leakage leads to ambiguity in tasks where temporal order and causal attribution are essential, as exemplified by failures on causal and temporal components of the NExT-QA benchmark. CATP directly addresses this by enforcing unidirectional, frame-by-frame aggregation:
- Full bidirectional processing is preserved within each frame to maintain spatial token interactions.
- A block-causal attention mask ensures tokens from frame can attend only to the same or previous frames (), explicitly preventing backward leakage of information.
- A learnable terminal summary token, appended to the sequence, acts exclusively as a “causal sink,” aggregating global context without the ability to overwrite past frame representations.
2. Integration of CATP Within the V-CORE Framework
CATP is one of two central modules within the V-CORE architecture; the other is Learnable Spatial Aggregation (LSA). The workflow is as follows:
- Frame Encoding: Each input video frame is encoded by a frozen ViT-L/14, yielding patch embeddings per frame, .
- Learnable Spatial Aggregation (LSA): Each is pooled to salient tokens via learnable queries:
Typically, , significantly reducing patch redundancy.
- Sequence Formation: Flatten all frames’ tokens and append the learnable summary token , forming .
- Spatio-temporal Positional Encoding: Joint positional embeddings are added so each token encodes its global frame index and intra-frame position.
- CATP Transformer: The resulting input is processed by stacked CATP transformer layers, embedding explicit temporal constraints.
- Projection and LLM Input: The output tokens are projected into the LLM embedding space and prepended to the textual instruction for subsequent language generation or question answering.
3. Block-Causal Attention and Summary Token Dynamics
The core of CATP is its spatio-temporal block-causal masking within the transformer self-attention. The block-causal mask is constructed such that:
This configuration yields these properties:
- Tokens within the same frame (indices such that ) have unconstrained attention.
- Tokens in any frame cannot attend to future frame tokens, strictly enforcing the forward flow of temporal information.
The learnable summary token, appended at the end and assigned frame index , can only read from the full set of previous frames but cannot influence them, embodying the role of a “causal anchor” for final video-level context aggregation.
The CATP transformer update for each layer takes the form:
For the summary token, this specializes to:
ensuring the summary is a strictly causal aggregation of prior frames.
4. Parameter and Computational Efficiency
CATP achieves high efficiency with minimal trainable parameters:
- LSA parameters: for (M for ).
- CATP transformer: For layers: each layer has (each ), yielding M parameters per layer. The feedforward network contributes M parameters.
- End-to-end: Total added parameters for LSA+CATP are M.
- LLM adaptation: The backbone LLM (Vicuna-7B) is frozen. Adaptation uses 4-bit QLoRA for and in attention layers, LoRA rank , scaling , resulting in an additional M low-rank, 4-bit parameters.
- Training: The full trainable parameter count (M) allows single-GPU (24GB RAM) end-to-end training.
5. Experimental Evaluation and Ablation Results
CATP, within V-CORE, demonstrates empirical improvements on established VideoQA benchmarks, particularly for tasks requiring temporal and causal inference.
Benchmark Performance
- NExT-QA: V-CORE achieves 61.2% accuracy, surpassing a bidirectional transformer baseline at 58.0%.
- Temporal reasoning: Absolute performance gain of +3.5% in relevant sub-tasks.
- Causal reasoning: Absolute gain of +5.2% over the bidirectional baseline.
Ablation Study
| Configuration | NExT-QA Accuracy (%) |
|---|---|
| MeanPool spatial + linear temporal | 53.8 |
| + LSA (Learnable Spatial Aggregation) | 55.6 |
| + Bidirectional Transformer | 58.0 |
| + Block-causal CATP | 60.2 |
| + Terminal Dynamic Summary | 61.2 |
The addition of LSA and the shift to block-causal attention yield progressive improvements, with the terminal dynamic summary contributing the final performance increase. This validates that unidirectional, frame-wise attention and a causal sink summary token enhance the coherence of temporal and causal reasoning.
6. Significance and Implications
CATP corrects a fundamental shortcoming of naive bidirectional attention in Video-LLMs with respect to temporal and causal video understanding. By architecturally enforcing the arrow of time and aggregating temporally ordered information, CATP provides a robust computational substrate for downstream LLMs to answer questions dependent on event chronology and causality. The approach is parameter-efficient, scalable to standard hardware, and preserves performance on tasks that are not exclusively temporal or causal in nature.
A plausible implication is that similar block-causal masking and summary-token strategies could be beneficial in other sequential multimodal reasoning domains where temporal leakage is detrimental to interpretability or task correctness.
7. Related Methodologies and Architectural Context
CATP builds upon and differentiates itself from standard transformer attention, traditional causal masks (which fail to preserve intra-frame spatial modeling when applied naively), and prior work collapsing video frames via mean pooling or linear projection. The explicit block-causal design allows unconstrained spatial reasoning while preserving the temporal unidirectionality essential for fine-grained event attribution. Its integration with parameter-efficient LLM adaptation techniques, such as QLoRA, further aligns it with contemporary trends towards resource-efficient large model adaptation and inference (Kang et al., 5 Jan 2026).