ViFi-CLIP: Efficient Video-Language Adaptation

Updated 13 September 2025

The paper introduces ViFi-CLIP, which efficiently fine-tunes CLIP with average pooling of frame embeddings to adapt for video recognition without complex temporal modules.
It demonstrates robust performance in zero-shot, few-shot, and fully supervised benchmarks while maintaining parameter efficiency and high computational throughput.
The approach employs a bridge-and-prompt strategy to tackle low-data regimes, preserving generalization and minimizing overfitting through minimal architectural changes.

ViFi-CLIP is a streamlined framework for video-language learning that adapts pre-trained image-text models—specifically CLIP—to video recognition tasks by relying on a minimal architectural extension: temporal pooling of independently encoded video frames and full model fine-tuning on video datasets. Rather than introducing dedicated spatio-temporal or transformer modules to explicitly model inter-frame relationships, ViFi-CLIP demonstrates that implicit temporal reasoning can emerge from average pooling of frame embeddings, provided the image and text encoders are fine-tuned jointly. This approach achieves robust performance on zero-shot, few-shot, and fully supervised video recognition benchmarks and establishes a strong baseline for more complex video-language adaptations.

1. Motivation and Challenges in Transferring CLIP to Video

CLIP's large-scale multimodal pretraining on image-text pairs results in transferable semantic representations. However, the domain gap between static images and temporally dynamic videos, along with limited availability of large-scale video-text data, poses specific challenges:

Temporal Cues: Image-level encoders lack direct temporal modeling capacites; naïve application to video (e.g., pooling features without temporal adaptation) leaves inter-frame and motion cues underexploited.
Generalization versus Specialization: Adding video-specific modules (extra attention blocks, transformers) may capture rich temporal structure, but often increases overfitting risk and computational overhead, especially in low-data regimes.
Parameter Efficiency: Many prior video-CLIP adaptations significantly increase parameter and compute costs.

ViFi-CLIP confronts these issues by entirely reusing the original CLIP architecture and adapting it via supervised or contrastive fine-tuning on (modestly-sized) video datasets, relying on frame-level processing and late fusion by average pooling.

2. Methodology: Frame-wise Encoding and Temporal Pooling

ViFi-CLIP operates as follows:

Frame Sampling and Processing: For each input video $V$ (sampled to $T$ frames, e.g., $T=16$ or $T=32$ ), each frame is independently resized (usually to $224\times224$ ) and fed through the CLIP image encoder, producing a sequence of frame embeddings $x_i \in \mathbb{R}^D$ for $i = 1,\ldots,T$ .
Temporal Pooling: The video-level embedding $v$ is constructed by average-pooling across frame embeddings:

$v = \frac{1}{T} \sum_{i=1}^T x_i$

This "embedding-level fusion" enables the model to capture aspects of scene dynamics, motion, and inter-object relationships implicitly, without modeling explicit temporal dependencies.

Text Encoding: The CLIP text encoder processes the class label (or natural language query), typically in the template "a photo of a <category>", to obtain a text embedding $t$ .
Contrastive Objective: Model fine-tuning is performed by maximizing the alignment between pooled video and text embeddings:

$\mathcal{L} = -\sum_i \log \left[ \frac{\exp(\operatorname{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\operatorname{sim}(v_i, t_j)/\tau)} \right]$

where $\operatorname{sim}$ denotes cosine similarity and $\tau$ is a learned temperature parameter.

This framework is used for full model fine-tuning on available video data.

3. Bridge and Prompt Approach for Low-Data Regimes

When extensive fine-tuning is prohibitive due to data scarcity:

Bridge Phase: The CLIP model is first adapted (fine-tuned) on the video dataset to bridge the initial image-to-video domain gap.
Prompt Phase: With the "bridge" model weights frozen, lightweight learnable prompt tokens are introduced on both the vision and language branches. These prompts are trained on the limited downstream data, allowing the model to adapt to task-specific requirements with minimal risk of overfitting.

This two-stage process preserves the generalization power of CLIP while providing efficient task adaptation through prompt tuning, offering practical benefits for few-shot or distribution-shifted scenarios.

4. Empirical Evaluation and Performance Analysis

ViFi-CLIP is evaluated under multiple regimes:

Zero-shot Transfer: Fine-tuning on Kinetics-400 and evaluating zero-shot on HMDB-51, UCF-101, and Kinetics-600. ViFi-CLIP substantially outperforms vanilla CLIP and several uni-modal and video-adapted baselines (e.g., ActionCLIP, XCLIP), with up to 6–7% gains in zero-shot accuracy.
Base-to-Novel Split: When classes are split into base and novel groups, ViFi-CLIP offers a superior trade-off between base and novel accuracy and improves the harmonic mean over prior methods, indicating strong generalization.
Few-shot and Fully-supervised: For few-shot setups ( $K = 2, 4, 8, 16$ ), ViFi-CLIP displays consistent and significant improvements over prompt-based and heavy parametric extensions of CLIP. For fully supervised video action recognition, competitive top-1 and top-5 accuracy is attained, with fewer GFLOPs and higher throughput compared to more elaborate approaches.

Ablation studies confirm that embedding-level fusion (averaging frame embeddings) consistently surpasses decision-level or image-level fusion, especially on tasks with significant temporal complexity.

5. Computational Efficiency and Implementation Considerations

ViFi-CLIP's architecture is minimal:

No Additional Video Modules: Unlike approaches that introduce temporal transformers or proxy tokens, ViFi-CLIP requires no architectural changes to the encoders.
Resource Advantages: The model maintains the parameter count of CLIP, enjoys high hardware throughput (e.g., $>$ 70 images/sec on NVIDIA A100 for inference), and requires less compute for both training and deployment.
Training Protocol: Requires only frame sampling, resizing, and batching for video inputs, followed by joint fine-tuning of vision and text encoders. Training is compatible with standard CLIP infrastructure.
Code Availability: The reference implementation and pretrained checkpoints are made publicly available at https://github.com/muzairkhattak/ViFi-CLIP.

6. Comparative Context and Extensions

ViFi-CLIP represents an architectural baseline against which recent and future video-CLIP adaptations are compared. In contrast to:

Proxy-based or Transformer Extensions: Models introducing video proxy tokens, joint-attention modules, or cross-modal adapters (e.g., CLIP-ViP (Xue et al., 2022), FiGCLIP (S et al., 15 Jan 2024)) increase complexity to explicitly model inter-frame temporal structure.
Interpolation and Continual Learning: Open-VCLIP (Weng et al., 2023) introduces temporal attention expansion and interpolated weight regularization to preserve zero-shot generality.
Prompt-efficient Tuning: Bridge-and-prompt and LoRA-based approaches (cf. FiGCLIP) offer efficient fine-tuning but depend on dataset-specific composition.

ViFi-CLIP demonstrates that robust temporal and semantic adaptation is possible with the simplest fusion of existing CLIP-trained representations, challenging the necessity for more elaborate architectural modifications in data-constrained settings.

7. Future Research Directions

Efforts to further improve video-language adaptation may consider:

Temporal Structure Modeling: Whether more explicit modeling (e.g., transformers across frames as in CLIP-ViP) brings further improvements when sufficient video data are available.
Fine-grained Compositionality: Developing mechanisms—such as hierarchical loss objectives or contextually structured prompts—that can go beyond mean pooling for syntactic and compositional reasoning (as in FiGCLIP).
Domain Generalization: Investigating adaptation schemes (e.g., continual or unsupervised domain adaptation) that further close the image–video gap and mitigate distribution shift.
Parameter- and Data-Efficiency: Quantifying trade-offs between full fine-tuning, prompt tuning, and other low-rank or modular adaptation strategies for resource-constrained deployment and lifelong learning scenarios.

ViFi-CLIP establishes a robust, computationally efficient baseline for general-purpose video-language modeling and remains a reference point for subsequent research in transferable vision-language representations for dynamic multimodal data (Rasheed et al., 2022).