RefCLAPScore: Multimodal Evaluation Metric

Updated 15 April 2026

RefCLAPScore is an evaluation metric that integrates cross-modal grounding with reference similarity to assess audio and image outputs.
It computes a candidate’s compatibility and its best reference match, fusing these scores with the harmonic mean for balanced evaluation.
The metric is applied in tasks like audio source separation and captioning, showing improved alignment with human ratings over reference-free methods.

RefCLAPScore is an evaluation metric designed to measure the quality of multimodal outputs—specifically audio or image content paired with natural language queries or captions—in a reference-augmented, cross-modal embedding space. Rooted in the contrastive learning paradigm exemplified by CLAP (Contrastive Language-Audio Pretraining) and CLIP (Contrastive Language-Image Pretraining), RefCLAPScore integrates both modality-to-text grounding and similarity to available references to provide a composite measure of output fidelity and relevance. It extends pure reference-free cosine approaches (such as CLAPScore and CLIPScore) by harmonically combining a candidate’s cross-modal compatibility score with its best similarity to any reference, providing a more holistic evaluation signal in applications like language-queried audio source separation, audio captioning, and image captioning.

1. Mathematical Formulation and Core Definition

Given a candidate output (e.g., separated audio or generated caption) and one or more reference ground-truths, RefCLAPScore is computed in three steps:

Modality Encoding: Both the candidate and the reference are mapped into a shared, $d$ $d$ -dimensional embedding space using CLAP encoders for audio–text tasks or CLIP encoders for image–text tasks:
- For audio–text:
$E_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d$
Cosine Similarity:
- Each embedding is L2-normalized. The score between $x_\text{audio}$ and $x_\text{text}$ is:
$\mathrm{RefCLAPScore}(x_\text{text}, x_\text{audio}) = \cos(E_\text{text}, E_\text{audio}) = \frac{E_\text{text}^\top E_\text{audio}}{\|E_\text{text}\|_2 \|E_\text{audio}\|_2}$
Reference-Augmented Fusion: In settings with multiple reference texts (e.g., image captioning), the highest similarity with any reference is computed, and candidate/reference compatibilities are aggregated using the harmonic mean. For generalization from CLIPScore (Hessel et al., 2021):
- Let $a$ be the CLAPScore (candidate vs. audio/image), $b$ be $\max_{k}\cos$ (candidate, reference $_k$ ).
- The final composite is:
$\mathrm{RefCLAPScore} = \frac{2ab}{a+b}$

This ensures both cross-modal grounding and reference consistency.

2. Underlying Model Architecture and Pretraining Objective

RefCLAPScore relies fundamentally on the architecture and training regimes of CLAP or CLIP:

Audio/Visual Encoder ( $E_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d$ 0/ $E_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d$ 1): Deep convolutional or transformer-based stack, ingesting log-mel spectrograms (audio) or pixels (images), with global pooling, linear projection, and L2 normalization.
Text Encoder ( $E_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d$ 2): BERT-style transformer over tokenized or byte-pair-encoded text.
Contrastive Loss (InfoNCE):

$E_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d$ 3

where similarity is cosine, and $E_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d$ 4 is a (learned or fixed) inverse-temperature parameter.

Total Loss:

$E_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d$ 5

Pretrained Checkpoints: Evaluation typically uses fixed, publicly-released CLAP or CLIP weights; no downstream fine-tuning for scoring is performed.

3. Algorithmic Workflow and Feature Extraction

The end-to-end RefCLAPScore computation includes:

Waveform/Image and Text Preprocessing:
- Audio: Resample (e.g., 16 kHz), windowing (25 ms, 10 ms hop), mel-spectrogram (typically 64 or 128 bins), per-frequency normalization.
- Text: Tokenization (BPE or WordPiece, $E_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d$ 6), max length enforcement and padding/truncation.
Embedding Computation and Normalization:

$E_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d$ 9

Scoring:

$x_\text{audio}$ 0

Sliding Window Aggregation (optional, especially in captioning): To address instability from window truncation, embeddings may be averaged over temporally-overlapping windows before normalization (SLIDE-CLAP).

4. Benchmarking, Empirical Results, and Comparisons

RefCLAPScore has been evaluated in multiple contexts:

Audio-Text Alignment (DCASE 2024 Task 9): Used as primary/auxiliary metric for language-queried audio source separation. Table 1 presents the following system-level results (Xiao et al., 2024):

Method	SDR	SI-SDR	CLAPScore	RefCLAPScore
Baseline	5.71	3.86	0.24	0.25
Baseline-Augm.	5.94	4.19	0.24	0.25
AudioSep	8.19	6.68	0.26	0.27

Caption Evaluation (BRACE Benchmark): Maximum F1 on main sub-benchmarks reaches approximately 70 for best LAION-CLAP variant in SLIDE mode; hallucination discrimination is higher, but never exceeds 89 F1 (Guo et al., 11 Dec 2025).
Correlation with Human Ratings: Correlation (Pearson, $E_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d$ 7) with SDR for audio separation use cases is moderate:
- CLAPScore vs. SDR: 0.27; RefCLAPScore vs. SDR: 0.226 (Xiao et al., 2024).
- On image captioning, RefCLIPScore consistently outperforms pure CLIPScore and leading reference-based metrics on literal description tasks (Hessel et al., 2021).

5. Contextualization, Implementation Practices, and Limitations

RefCLAPScore unifies two semantic axes: content grounding in the input modality and overlap with one or more reference outputs. Its usability and performance are subject to several practical considerations:

Advantages:
- Reference-augmented: Captures both modality fidelity and textual fluency/coverage.
- Generalizability: Architecture-agnostic with respect to type of content (audio, image).
- Computational efficiency: Single forward pass per encoder per pair; minimal preprocessing overhead; public models available.
Limitations:
- Moderate alignment with classic distortion-based or n-gram metrics (e.g., SDR, CIDEr)—RefCLAPScore is complementary, not substitutive.
- Sensitivity: Does not robustly capture subtle acoustic details or syntactic violations; scores may be insensitive to grammar when keywords match.
- Windowing: Random crop/truncation instability, remedied by sliding aggregation at increased computational cost.
- Pretraining Domain Bias: Performance may degrade for out-of-distribution content or rare modalities not present in pretraining data.
Best Practices:
- Report both cross-modal and reference-only metrics in experimental studies.
- Apply sliding window aggregation for longer/higher-variance samples (i.e., SLIDE-CLAP).
- For datasets with multiple references, compute $E_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d$ 8 over references for each candidate.

6. Applications and Extensions

RefCLAPScore underpins evaluation in a variety of multimodal generation and retrieval settings:

Language-Queried Audio Source Separation: Allows for direct semantic evaluation of system outputs without requiring ground-truth references—critical for real-world data (Xiao et al., 2024).
Foley Sound Synthesis: Used to replace inefficient post-generation filtering; latent-CLAP loss aligns synthesized content toward high-scoring RefCLAPScore regions, obviating the need for threshold-based sample selection (Karchkhadze et al., 2024).
Automatic Audio Captioning: Adopted for pairwise preference testing and quality measurement in benchmarks with and without high-quality reference captions (Guo et al., 11 Dec 2025).
Image Captioning: RefCLIPScore, the direct image counterpart, has established state-of-the-art correlations with expert and crowd judgments on literal datasets when references are supplied (Hessel et al., 2021).

RefCLAPScore is structurally analogous to RefCLIPScore for image–text; both fuse cross-modal compatibility and reference overlap through the harmonic mean, penalizing candidates that are deficient in either aspect. Reference-free metrics such as pure CLAPScore or CLIPScore offer broader applicability but lower fine-grained discrimination, especially in high-quality or adversarial/comparative settings.

Limitations exposed by BRACE (e.g., plateauing F1 on human–human or subtle corruption discrimination) motivate further research into embedding architectures, domain coverage, and calibration. A plausible implication is that future iterations may integrate more granular, context-sensitive or hierarchical fusion approaches to overcome current regime weaknesses in syntactic/granular quality assessment.

Markdown Report Issue Upgrade to Chat

References (4)

CLIPScore: A Reference-free Evaluation Metric for Image Captioning (2021)

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining (2024)

BRACE: A Benchmark for Robust Audio Caption Quality Evaluation (2025)

Latent CLAP Loss for Better Foley Sound Synthesis (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RefCLAPScore.

RefCLAPScore: Multimodal Evaluation Metric

1. Mathematical Formulation and Core Definition

2. Underlying Model Architecture and Pretraining Objective

3. Algorithmic Workflow and Feature Extraction

4. Benchmarking, Empirical Results, and Comparisons

5. Contextualization, Implementation Practices, and Limitations

6. Applications and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RefCLAPScore: Multimodal Evaluation Metric

1. Mathematical Formulation and Core Definition

2. Underlying Model Architecture and Pretraining Objective

3. Algorithmic Workflow and Feature Extraction

4. Benchmarking, Empirical Results, and Comparisons

5. Contextualization, Implementation Practices, and Limitations

6. Applications and Extensions

7. Relationship to Related Metrics and Ongoing Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research