CLIP-Higher: Query-Weighted Video Retrieval

Updated 13 August 2025

CLIP-Higher introduces a query-weighted aggregation method that replaces naive mean-pooling, improving temporal modeling in long video retrieval.
Empirical results show consistent gains, with up to 2–3 percentage points boost in Recall@1 on benchmarks like MSR-VTT and ActivityNet Captions.
The approach leverages a tunable softmax temperature to balance frame selectivity, offering a lightweight, interpretable solution for enhanced video retrieval.

Clip-Higher (CLIP-Higher) refers to a suite of methodological advances, baselines, and architectural modifications that improve the adaptation and performance of CLIP-style image–text models for complex tasks—primarily long video retrieval, but also in broader scenarios requiring enhanced temporal, semantic, or fine-grained discriminative power. The most canonical use of the term originates in “A CLIP-Hitchhiker’s Guide to Long Video Retrieval” (Bain et al., 2022), where it denotes a mathematically principled, query-weighted temporal aggregation mechanism. The term has subsequently been repurposed in related works to signify approaches that “elevate” CLIP-like models in semantic fidelity, temporal reasoning, and detail preservation.

1. Query-Weighted Temporal Aggregation: Mathematical Foundation

The principal innovation of CLIP-Higher is the replacement of naïve mean-pooling for per-frame CLIP embeddings with a query-scoring weighted mean, enabling more effective temporal aggregation in video retrieval. The process is formalized as follows:

Given a sequence of $K$ frame embeddings $I^{(k)} \in \mathbb{R}^d$ for frames $k=1, ..., K$ and a text query embedding $T \in \mathbb{R}^d$ , frame relevance is computed as the dot product: $s_k = \langle I^{(k)}, T \rangle$ The normalized frame weights $w_k$ are derived via a softmax with temperature $\tau$ : $w_k = \frac{\exp(s_k / \tau)}{\sum_{j=1}^K \exp(s_j / \tau)}$ The aggregated video-level embedding is then: $\bar{V} = \sum_{k=1}^K w_k I^{(k)}$ Key properties of this mechanism:

As $\tau \rightarrow 0$ , aggregation becomes a hard max (single most relevant frame).
As $\tau \rightarrow \infty$ , the method reduces to uniform mean-pooling.
By tuning $\tau$ , one directly controls the selectivity versus uniformity of frame contribution, adapting to the temporal heterogeneity of video content.

This aggregation is parameter-free aside from $\tau$ (often treated as a single learnable scalar), thus offering a highly interpretable and lightweight solution compared with heavier alternatives.

2. Empirical Evaluation on Long Video Retrieval Benchmarks

The effectiveness of CLIP-Higher is established through extensive experimental results on prominent video retrieval datasets:

Benchmark	Recall@1	Recall@5	Recall@10	Notes
MSR-VTT (~15s avg.)	47.7%	74.1%	82.9%	Outperforms stronger temporal models
Condensed Movies	27.0%	52.3%	61.2%	Handles long-form, high-variance videos
ActivityNet Captions	44.0%	74.9%	86.1%	Demonstrates generalization to diverse, lengthy contexts

Compared to the mean-pooling baseline (e.g., 44.4% Recall@1 on MSR-VTT), the query-scoring aggregation provides consistent improvements (up to 2–3 percentage points at Recall@1, with larger gains on other metrics), setting a new baseline for subsequent models.

Notably, these gains are achieved without millions of extra parameters or architectural complexity, and with only a single trainable $\tau$ per run.

3. Comparison with Prior Temporal Modeling Techniques

Earlier video retrieval models applied uniform mean-pooling or incorporated complex temporal encoding strategies (e.g., self-attention over frames, joint attention blocks, or sequence modeling layers) that generally required substantial parameter budgets and increased computational cost. CLIP-Higher distinguishes itself by:

Demonstrating that even the naive mean-pooling of CLIP frame embeddings provides competitive results, due to CLIP’s strong per-frame semantics.
Showing that selective, query-conditioned reweighting provides systematic and statistically significant improvements over both mean-pooling and these more elaborate but often overfitted or under-optimized temporal models.
Offering visual evidence (via qualitative frame scoring) that high-weight frames correlate with semantically rich, query-relevant content while suppressing redundancy and irrelevance.

Ablation studies confirm that the weighted-mean formulation is not only robust across datasets but also interpretable, which was a deficiency in joint-attention or deep sequence models.

4. Generalization, Simplicity, and Implications for Future Research

The success of the CLIP-Higher aggregation baseline (query-weighted mean with softmax temperature) demonstrates that strong temporal modeling in long video retrieval can, to a large extent, be reduced to an alignment-weighted sum—if the underlying frame representations are sufficiently discriminative.

This insight motivates several research directions:

Leveraging query-scoring as a supervisory signal for end-to-end video representation learning, possibly integrating the weighting in model finetuning.
Hybrid approaches that maintain the interpretability of simple weighted sums, but couple them with more advanced (possibly transformer-based) temporal modeling for residual dynamics.
Exploring adaptive or context-aware temperature schedules, which can dynamically adjust selectivity according to video content structure.
Generalizing the principle for other types of hierarchical data (e.g., document retrieval, egocentric video, multimodal event parsing).

By elevating the baseline and providing detailed benchmark comparisons, CLIP-Higher reframes the challenge for new methods and sets a concise, powerful standard against which future innovations are to be evaluated.

5. Limitations and Practical Considerations

While CLIP-Higher delivers tangible improvements with minimal parameter overhead, several caveats are acknowledged:

Temporal dynamics in highly eventful or compositional video may not be fully captured by a frame-wise, non-sequential reweighting; such contexts may still benefit from complementary sequence modeling.
The $\tau$ parameter acts as a global sharpness controller; future models may seek to learn context- or query-specific temperatures, at increased computational complexity.
The method’s reliance on CLIP’s per-frame semantic power suggests that in domains where CLIP’s image encoder is less robust, aggregate improvement may be marginal.
Frame scoring is only as strong as the frame–text matching, so scenarios with subtle action differences or fine-grained temporal cues may require finer granularity of embedding or aggregation.

In practice, the method retains compatibility with CLIP’s zero-shot pipeline and imposes only lightweight computational cost (adding a matrix–vector multiplication and softmax during aggregation).

6. Impact and Reference Implementation

The introduction of CLIP-Higher raises the baseline for long video retrieval, emphasizing a paradigm where the simplest temporally-aware method—namely, per-frame query-conditioned weighting—outperforms or matches more elaborate strategies. This shifts the research landscape toward methods that explicitly justify additional complexity and inspires the integration of interpretable, sparsity-inducing aggregation across language–vision applications.

The mathematical foundation, strong empirical results, and lightweight implementation collectively position CLIP-Higher as a reference approach for efficient and effective video retrieval with pretrained vision–LLMs (Bain et al., 2022).

PDF Markdown Chat (Pro)

References (1)

A CLIP-Hitchhiker's Guide to Long Video Retrieval (2022)

Follow Topic

Get notified by email when new papers are published related to Clip-Higher.