Papers
Topics
Authors
Recent
Search
2000 character limit reached

Video-CCAM: Temporal Video-Language Model

Updated 20 March 2026
  • Video-CCAM is a multi-modal large language model architecture that integrates a causal cross-attention mechanism to enforce temporal consistency in video analysis.
  • It utilizes a frozen ViT-derived visual encoder, a CCAM projector with lower-triangular masks, and LLM adapters to efficiently process video streams.
  • The model achieves state-of-the-art performance on video question answering, captioning, and long-video understanding benchmarks while scaling to variable-length inputs.

Video-CCAM is a multi-modal LLM (Video-MLLM) architecture that advances video-language understanding by integrating a causal cross-attention mechanism into the interface between frame-wise visual encoders and LLMs. Designed to efficiently handle the increased visual token load from video input—including short and long video streams—Video-CCAM enables scalable modeling over a broad temporal range, avoiding the traditional pitfalls of feature downsampling or excessive context expansion. Its core innovation is the introduction of explicit, lower-triangular causal cross-attention masks (CCAMs) designed to enforce temporal consistency in video summary representations processed by LLMs. Video-CCAM achieves state-of-the-art performance across standard benchmarks for video question answering, captioning, and long-video understanding, and is available as an open-source codebase (Fei et al., 2024).

1. Model Architecture

Video-CCAM adopts a modular MLLM pipeline, consisting of three principal components: (a) a frozen visual encoder, (b) an intermediate projector realized as a causal cross-attention block with learnable queries, and (c) a frozen, optionally LoRA-adapted, LLM.

  • Visual Encoder: The standard backbone is a ViT-derived image encoder (SigLIP-SO400M in reference experiments), applied independently to each video frame. For each input frame jj, the encoder emits a patch sequence xjRL×Cx_j \in \mathbb{R}^{L \times C'}, with L=H×WL=H \times W corresponding to the combined spatial resolution per frame.
  • CCAM Projector: The intermediate module consists of a single cross-attention (with feed-forward network), which exposes a fixed bank of NN learnable query vectors {Qi}i=0N1R1×C\{Q_i\}_{i=0}^{N-1} \in \mathbb{R}^{1 \times C}. Each query aggregates visual embeddings from all prior frames up to a designated temporal cut-off, controlled by the causal mask. The number of output tokens is fixed at NN for any input video, ensuring conformity with downstream LLM context limitations.
  • LLM Adapter: The LLM (e.g., Phi-3 4B, Yi-1.5 9B, Phi-3 14B) receives the NN projected query embeddings prepended to any text prompt and handles the autoregressive generation downstream.

Data flow proceeds as: frames \rightarrow visual encoder {xj}\rightarrow \{x_j\} \rightarrow CCAM projector {yi}\rightarrow \{y_i\} \rightarrow LLM input tokens \rightarrow textual output (Fei et al., 2024).

2. Causal Cross-Attention Mask Design

A key innovation in Video-CCAM is the deployment of lower-triangular block masks within the cross-attention layer. In contrast to standard (temporally flat) cross-attention mechanisms, which collapse temporal ordering and allow each query full access to all frame embeddings, CCAMs enforce temporal causality. Explicitly, queries at later temporal positions only attend to frames up to their designated point.

Given TT video frames and NN learnable queries, the mask M{0,1}N×TM \in \{0,1\}^{N \times T} is defined so that

Mij={1,ijN/T 0,otherwiseM_{ij} = \begin{cases} 1, & i \geq j \cdot \lfloor N/T \rfloor \ 0, & \text{otherwise} \end{cases}

With this mask,

yi=j=0T1Mijexp(QiK(xj))V(xj)j=0T1Mijexp(QiK(xj))1Ly_i = \frac{ \sum_{j=0}^{T-1} M_{ij} \exp(Q_i K(x_j)^\top) V(x_j) }{ \sum_{j=0}^{T-1} M_{ij} \exp(Q_i K(x_j)^\top) \mathbf{1}_L }

By this construction, queries 0k0 \dots k summarize only visual context up to frame kT/N\lfloor k \cdot T/N \rfloor.

The continuous-view analysis interprets the video as a time-indexed signal x(τ)x(\tau). Query ii's output over [0,(i+1)T/N][0, (i+1)T/N] is

yi=0(i+1)T/NeQiK(x(τ))V(x(τ))dτ0(i+1)T/NeQiK(x(τ))dτy_i = \frac{ \int_0^{(i+1)T/N} e^{Q_i K^\top(x(\tau))} V(x(\tau)) d\tau }{ \int_0^{(i+1)T/N} e^{Q_i K^\top(x(\tau))} d\tau }

This formalism underpins the model's temporal consistency guarantee (Fei et al., 2024).

3. Training Paradigm

Training proceeds in two autoregressive stages:

  • Feature Alignment (Stage 1): All parameters except the CCAM projector (i.e., visual encoder and LLM) are frozen. The model is trained on 558K image-text pairs (LCS-558K), without any video sequences at this stage. Training is performed for one epoch, mixing images and 16-frame clips in each batch.
  • Visual Instruction Tuning (Stage 2): The projector and LoRA adapters (on both visual encoder and LLM) are trained using approximately 4.4M samples. The corpus includes video–text data from VideoChat2 and LLaVA-Hound, and question-answer/caption data from EgoTaskQA, PerceptionTestQA, ActivityNetQA, STAR, among others. Short-form answers are extended into fuller natural language targets by GPT-4 or Gemini. Optimization is via standard next-token autoregressive loss; no contrastive or distance-based loss is introduced.

No retraining is necessary for adapting to longer input sequences, as the causal mask pattern generalizes to arbitrary TT (Fei et al., 2024).

4. Scaling Behavior and Long-Video Adaptation

The projector architecture allows Video-CCAM to accept variable-length video input at inference, while always outputting NN fixed-length embeddings to the LLM.

  • Training is performed on Ttrain=16T_{\mathrm{train}} = 16 frames.
  • Inference can operate on Ttest=96T_{\mathrm{test}} = 96, 128, 256, or more frames, with the mask MM expanded accordingly.
  • Computational cost scales linearly in TT for the visual encoder and as O(NT)O(NT) for the cross-attention projector, with NN fixed and typically set to 1,024 queries. Empirical results indicate performance improvements continue up to T96T \sim 96 frames, beyond which gains plateau.

This structural property obviates the need for either severe visual feature downsampling or expansion of LLM context for longer videos (Fei et al., 2024).

5. Benchmark Evaluation

Video-CCAM achieves consistent, competitive, or state-of-the-art results across a range of short-video and long-video benchmarks. The following summarizes leading performance metrics:

Benchmark Parameter Setting Video-CCAM 4B / 9B / 14B Performance
MVBench (16 frames eval) 4B, 9B, 14B 62.8% / 64.6% / 63.1%
TGIF-QA (VideoChatGPT-QA) 4B model 83.0%
VideoVista (96 frames) 4B, 14B 70.82% / 76.55%
MLVU (long videos) 14B 63.1% (multi-choice M-Avg)
Video-MME 14B 56.1% w/subtitles (<30B models)

The models outperform all prior open-source Video-MLLMs at comparable parameter budgets and generalize successfully to longer video understanding despite being trained solely on images and 16-frame samples (Fei et al., 2024).

6. Ablation Studies, Limitations, and Future Directions

Ablations highlight the impact of both mask design and key architectural hyperparameters:

  • Replacing CCAM with a dense (all-ones) mask drops MVBench accuracy from 62.8% to 59.1%.
  • Employing only temporal position embeddings delivers negligible improvement over the baseline.
  • Varying the number of learnable queries shows that N=1024N=1024 provides the optimal balance of computational cost and accuracy (60.8% at 512, 62.8% at 1024, 62.7% at 2048).
  • No performance degradation is observed when testing on frame counts (TT) far exceeding those used in training, up to at least 96 frames.

Limitations, as identified by the authors, include the exclusive reliance on image encoders (which may weaken the capture of fine-grained motion), the use of a single cross-attention block (suggesting possible gains from deeper or hierarchical architectures), and the absence of explicit contrastive or temporal modeling objectives. The extension of CCAM concepts to non-visual modalities such as audio or to more sophisticated attention patterns is suggested as a promising future direction (Fei et al., 2024).

7. Application to Connected, Cooperative and Automated Mobility (CCAM) Data Streaming

While Video-CCAM addresses video-language understanding, supporting CCAM data streaming over 5G MEC/Cloud platforms follows a complementary engineering track. Vehicular video streams, chunked into binary NALUs (50–200 KB typical, H.264/H.265 encoding), are transmitted via AMQP/MQTT to an edge MEC broker, with optional anonymization, and relayed by Kafka to cloud-based consumers. Benchmarking on such IoT platforms shows that latencies of 25–35 ms are typical for MEC-only routing, while MEC-to-cloud paths add \sim600 ms. Throughput scales linearly with stream count up to the 5G uplink cap (\sim100 Mbps for \sim50 concurrent video streams at 2 Mbps) and is modeled analytically using: L(N,S)=L0+αN+βSL(N,S) = L_0 + \alpha N + \beta S

T(N,S)=min(NSF,R5G,CMEC,CCloud)T(N,S) = \min\left(N S F, R_{5G}, C_\mathrm{MEC}, C_\mathrm{Cloud} \right)

A plausible implication is that, while Video-CCAM can process video data at scale, achieving low-latency delivery in CCAM scenarios depends on careful MEC/cloud size planning and possibly bypassing cloud brokers for delay-sensitive applications (Mogollón et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video-CCAM.