Papers
Topics
Authors
Recent
Search
2000 character limit reached

RefCLAPScore: Multimodal Evaluation Metric

Updated 15 April 2026
  • RefCLAPScore is an evaluation metric that integrates cross-modal grounding with reference similarity to assess audio and image outputs.
  • It computes a candidate’s compatibility and its best reference match, fusing these scores with the harmonic mean for balanced evaluation.
  • The metric is applied in tasks like audio source separation and captioning, showing improved alignment with human ratings over reference-free methods.

RefCLAPScore is an evaluation metric designed to measure the quality of multimodal outputs—specifically audio or image content paired with natural language queries or captions—in a reference-augmented, cross-modal embedding space. Rooted in the contrastive learning paradigm exemplified by CLAP (Contrastive Language-Audio Pretraining) and CLIP (Contrastive Language-Image Pretraining), RefCLAPScore integrates both modality-to-text grounding and similarity to available references to provide a composite measure of output fidelity and relevance. It extends pure reference-free cosine approaches (such as CLAPScore and CLIPScore) by harmonically combining a candidate’s cross-modal compatibility score with its best similarity to any reference, providing a more holistic evaluation signal in applications like language-queried audio source separation, audio captioning, and image captioning.

1. Mathematical Formulation and Core Definition

Given a candidate output (e.g., separated audio or generated caption) and one or more reference ground-truths, RefCLAPScore is computed in three steps:

  1. Modality Encoding: Both the candidate and the reference are mapped into a shared, dd-dimensional embedding space using CLAP encoders for audio–text tasks or CLIP encoders for image–text tasks:
    • For audio–text:

    Eaudio=EA(xaudio)∈Rd,Etext=ET(xtext)∈RdE_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d

  2. Cosine Similarity:

    • Each embedding is L2-normalized. The score between xaudiox_\text{audio} and xtextx_\text{text} is:

    RefCLAPScore(xtext,xaudio)=cos⁔(Etext,Eaudio)=Etext⊤Eaudio∄Etext∄2∄Eaudio∄2\mathrm{RefCLAPScore}(x_\text{text}, x_\text{audio}) = \cos(E_\text{text}, E_\text{audio}) = \frac{E_\text{text}^\top E_\text{audio}}{\|E_\text{text}\|_2 \|E_\text{audio}\|_2}

  3. Reference-Augmented Fusion: In settings with multiple reference texts (e.g., image captioning), the highest similarity with any reference is computed, and candidate/reference compatibilities are aggregated using the harmonic mean. For generalization from CLIPScore (Hessel et al., 2021):

    • Let aa be the CLAPScore (candidate vs. audio/image), bb be max⁔kcos⁔\max_{k}\cos(candidate, referencek_k).
    • The final composite is:

    RefCLAPScore=2aba+b\mathrm{RefCLAPScore} = \frac{2ab}{a+b}

  • This ensures both cross-modal grounding and reference consistency.

2. Underlying Model Architecture and Pretraining Objective

RefCLAPScore relies fundamentally on the architecture and training regimes of CLAP or CLIP:

  • Audio/Visual Encoder (Eaudio=EA(xaudio)∈Rd,Etext=ET(xtext)∈RdE_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d0/Eaudio=EA(xaudio)∈Rd,Etext=ET(xtext)∈RdE_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d1): Deep convolutional or transformer-based stack, ingesting log-mel spectrograms (audio) or pixels (images), with global pooling, linear projection, and L2 normalization.

  • Text Encoder (Eaudio=EA(xaudio)∈Rd,Etext=ET(xtext)∈RdE_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d2): BERT-style transformer over tokenized or byte-pair-encoded text.

  • Contrastive Loss (InfoNCE):

Eaudio=EA(xaudio)∈Rd,Etext=ET(xtext)∈RdE_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d3

where similarity is cosine, and Eaudio=EA(xaudio)∈Rd,Etext=ET(xtext)∈RdE_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d4 is a (learned or fixed) inverse-temperature parameter.

  • Total Loss:

Eaudio=EA(xaudio)∈Rd,Etext=ET(xtext)∈RdE_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d5

  • Pretrained Checkpoints: Evaluation typically uses fixed, publicly-released CLAP or CLIP weights; no downstream fine-tuning for scoring is performed.

3. Algorithmic Workflow and Feature Extraction

The end-to-end RefCLAPScore computation includes:

  • Waveform/Image and Text Preprocessing:

    • Audio: Resample (e.g., 16 kHz), windowing (25 ms, 10 ms hop), mel-spectrogram (typically 64 or 128 bins), per-frequency normalization.
    • Text: Tokenization (BPE or WordPiece, Eaudio=EA(xaudio)∈Rd,Etext=ET(xtext)∈RdE_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d6), max length enforcement and padding/truncation.
  • Embedding Computation and Normalization:

Eaudio=EA(xaudio)∈Rd,Etext=ET(xtext)∈RdE_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d9

  • Scoring:

xaudiox_\text{audio}0

  • Sliding Window Aggregation (optional, especially in captioning): To address instability from window truncation, embeddings may be averaged over temporally-overlapping windows before normalization (SLIDE-CLAP).

4. Benchmarking, Empirical Results, and Comparisons

RefCLAPScore has been evaluated in multiple contexts:

  • Audio-Text Alignment (DCASE 2024 Task 9): Used as primary/auxiliary metric for language-queried audio source separation. Table 1 presents the following system-level results (Xiao et al., 2024):
Method SDR SI-SDR CLAPScore RefCLAPScore
Baseline 5.71 3.86 0.24 0.25
Baseline-Augm. 5.94 4.19 0.24 0.25
AudioSep 8.19 6.68 0.26 0.27
  • Caption Evaluation (BRACE Benchmark): Maximum F1 on main sub-benchmarks reaches approximately 70 for best LAION-CLAP variant in SLIDE mode; hallucination discrimination is higher, but never exceeds 89 F1 (Guo et al., 11 Dec 2025).
  • Correlation with Human Ratings: Correlation (Pearson, Eaudio=EA(xaudio)∈Rd,Etext=ET(xtext)∈RdE_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d7) with SDR for audio separation use cases is moderate:
    • CLAPScore vs. SDR: 0.27; RefCLAPScore vs. SDR: 0.226 (Xiao et al., 2024).
    • On image captioning, RefCLIPScore consistently outperforms pure CLIPScore and leading reference-based metrics on literal description tasks (Hessel et al., 2021).

5. Contextualization, Implementation Practices, and Limitations

RefCLAPScore unifies two semantic axes: content grounding in the input modality and overlap with one or more reference outputs. Its usability and performance are subject to several practical considerations:

  • Advantages:
    • Reference-augmented: Captures both modality fidelity and textual fluency/coverage.
    • Generalizability: Architecture-agnostic with respect to type of content (audio, image).
    • Computational efficiency: Single forward pass per encoder per pair; minimal preprocessing overhead; public models available.
  • Limitations:
    • Moderate alignment with classic distortion-based or n-gram metrics (e.g., SDR, CIDEr)—RefCLAPScore is complementary, not substitutive.
    • Sensitivity: Does not robustly capture subtle acoustic details or syntactic violations; scores may be insensitive to grammar when keywords match.
    • Windowing: Random crop/truncation instability, remedied by sliding aggregation at increased computational cost.
    • Pretraining Domain Bias: Performance may degrade for out-of-distribution content or rare modalities not present in pretraining data.
  • Best Practices:
    • Report both cross-modal and reference-only metrics in experimental studies.
    • Apply sliding window aggregation for longer/higher-variance samples (i.e., SLIDE-CLAP).
    • For datasets with multiple references, compute Eaudio=EA(xaudio)∈Rd,Etext=ET(xtext)∈RdE_\text{audio} = E_A(x_\text{audio}) \in \mathbb{R}^d,\quad E_\text{text} = E_T(x_\text{text}) \in \mathbb{R}^d8 over references for each candidate.

6. Applications and Extensions

RefCLAPScore underpins evaluation in a variety of multimodal generation and retrieval settings:

  • Language-Queried Audio Source Separation: Allows for direct semantic evaluation of system outputs without requiring ground-truth references—critical for real-world data (Xiao et al., 2024).
  • Foley Sound Synthesis: Used to replace inefficient post-generation filtering; latent-CLAP loss aligns synthesized content toward high-scoring RefCLAPScore regions, obviating the need for threshold-based sample selection (Karchkhadze et al., 2024).
  • Automatic Audio Captioning: Adopted for pairwise preference testing and quality measurement in benchmarks with and without high-quality reference captions (Guo et al., 11 Dec 2025).
  • Image Captioning: RefCLIPScore, the direct image counterpart, has established state-of-the-art correlations with expert and crowd judgments on literal datasets when references are supplied (Hessel et al., 2021).

RefCLAPScore is structurally analogous to RefCLIPScore for image–text; both fuse cross-modal compatibility and reference overlap through the harmonic mean, penalizing candidates that are deficient in either aspect. Reference-free metrics such as pure CLAPScore or CLIPScore offer broader applicability but lower fine-grained discrimination, especially in high-quality or adversarial/comparative settings.

Limitations exposed by BRACE (e.g., plateauing F1 on human–human or subtle corruption discrimination) motivate further research into embedding architectures, domain coverage, and calibration. A plausible implication is that future iterations may integrate more granular, context-sensitive or hierarchical fusion approaches to overcome current regime weaknesses in syntactic/granular quality assessment.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RefCLAPScore.