Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temporal Prompt Alignment Score (P-score)

Updated 28 January 2026
  • Temporal Prompt Alignment Score (P-score) is defined as the cosine similarity between a temporally aggregated video embedding and class-specific text prompt embeddings.
  • It integrates temporal modeling, contrastive learning, and uncertainty calibration to enhance fetal CHD classification from ultrasound videos.
  • Empirical results show that optimizing hyperparameters and incorporating CVAESM improves both discrimination and calibration, boosting metrics like F1 and AUC.

The Temporal Prompt Alignment Score ("P-score", Editor's term) is a metric that quantifies the alignment between a temporally aggregated video embedding and each class-specific text prompt embedding within the Temporal Prompt Alignment (TPA) framework for fetal congenital heart defect (CHD) classification from ultrasound videos. The P-score is formally defined as the cosine similarity between a video-level embedding produced by a temporal extractor and the projected embedding of a clinically motivated text prompt representing each candidate class. The P-score serves as the core building block for both classification and contrastive learning within this system, integrating temporal modeling, image-text alignment, and uncertainty calibration for robust video-based medical diagnosis (Taratynova et al., 21 Aug 2025).

1. Formal Definition

Let xtR768x_t \in \mathbb{R}^{768} denote the feature vector for frame tt, extracted via a frozen image encoder such as EchoCLIP or FetalCLIP. For a subclip of length LL, these are stacked into X=[x1;;xL]RL×768X = [x_1; \dots; x_L] \in \mathbb{R}^{L \times 768}. A lightweight temporal extractor ftempf_\text{temp} (e.g., GNN, xLSTM, TCN) is used to aggregate these frame-level features, producing a video-level embedding h=ftemp(X)R256h = f_\text{temp}(X) \in \mathbb{R}^{256}.

For each class c{0,,C1}c \in \{0,\dots,C-1\}, let πcR768\pi_c \in \mathbb{R}^{768} be the embedding of a class-specific text prompt encoded by a frozen text encoder. A learned projection WtxtR256×768W_\text{txt} \in \mathbb{R}^{256 \times 768} yields πcproj=WtxtπcR256\pi_c^{\text{proj}} = W_\text{txt} \pi_c \in \mathbb{R}^{256}.

The P-score for class cc is defined as

Pcsc=h,πcprojh  πcproj[1,1].P_c \equiv s_c = \frac{\langle h, \pi_c^{\text{proj}} \rangle}{\|h\| \; \|\pi_c^{\text{proj}}\|} \in [-1, 1].

where ,\langle \cdot, \cdot \rangle is the standard Euclidean inner product and \| \cdot \| the 2\ell_2 norm. This cosine similarity measures the alignment between the temporally aggregated video features and each class prompt embedding (Taratynova et al., 21 Aug 2025).

2. Stepwise Computation Procedure

The computation of the P-score proceeds as follows:

  1. Frame-level Feature Extraction:
    • LL consecutive video frames {It}t=1L\{I_t\}_{t=1}^L are sampled from the ultrasound sequence.
    • For each frame, a frozen image encoder maps ItI_t to xt=fimg(It)R768x_t = f_\text{img}(I_t) \in \mathbb{R}^{768}.
  2. Temporal Aggregation:
    • The sequence of frame features X=[x1;;xL]X = [x_1;\dots;x_L] is passed through a temporal extractor: h=ftemp(X)R256h = f_\text{temp}(X) \in \mathbb{R}^{256}.
  3. Prompt Embedding Construction:
    • For each class cc, a concise, task-specific text prompt (e.g., “Is the fetal heart normal in this 4CH ultrasound view?”) is encoded to πc\pi_c via a frozen text encoder.
    • A learned linear layer projects πc\pi_c to video-embedding space: πcproj=Wtxtπc\pi_c^{\text{proj}} = W_\text{txt} \pi_c.
  4. Alignment Computation:
    • Calculate the P-score for each class as the cosine similarity between hh and πcproj\pi_c^\text{proj}.
  5. Softmax Normalization (optional):
    • The vector of P-scores (Pc)c=0C1(P_c)_{c=0}^{C-1} can be temperature-softmaxed:

    pc=exp(Pc/τ)j=0C1exp(Pj/τ),τ>0.p_c = \frac{\exp(P_c/\tau)}{\sum_{j=0}^{C-1} \exp(P_j/\tau)}, \qquad \tau > 0.

The highest pcp_c determines the predicted class; PcP_c values thus underlie both class assignment and estimated confidence (Taratynova et al., 21 Aug 2025).

3. Hyperparameters Impacting P-score Dynamics

Several key hyperparameters control the behavior and discriminative power of P-scores:

  • Margin m>0m > 0: Used in the margin-hinge contrastive loss, enforcing a required gap between the true-class P-score and the hardest negative P-score.

  • Contrastive weight α0\alpha \ge 0: Balances the contribution of the margin-hinge contrastive loss relative to the conventional cross-entropy classification loss.

  • Temperature τ>0\tau > 0: Controls the sharpness or flatness of the softmax over P-scores, affecting output confidence calibration.

  • CVAESM KL-weight β0\beta \ge 0: Sets the strength of the KL-divergence regularizer in the Conditional Variational Autoencoder Style Modulation module for uncertainty estimation.

Optimizing these hyperparameters is essential for achieving favorable trade-offs among discrimination, alignment, and calibration (Taratynova et al., 21 Aug 2025).

4. Training, Inference, and Calibration Integration

The P-score’s role in training, inference, and calibration is multifaceted:

  • Classification Loss:

    • Standard cross-entropy is computed on temperature-softmaxed P-scores: Lcls=cyclogpc\mathcal{L}_{\mathrm{cls}} = -\sum_c y_c \log p_c, where ycy_c is the one-hot true label.
  • Margin-hinge Contrastive Loss:
    • The positive P-score s+=Pys^+ = P_y for the true class yy, and the hardest negative s=maxjyPjs^- = \max_{j\neq y} P_j, yields

    Lctr=max(0,ms++s).\mathcal{L}_{\mathrm{ctr}} = \max(0, m - s^+ + s^-). - The total loss (omitting uncertainty) is Ltotal=Lcls+αLctr\mathcal{L}_\mathrm{total} = \mathcal{L}_{\mathrm{cls}} + \alpha \mathcal{L}_{\mathrm{ctr}}.

  • Uncertainty Quantification (CVAESM):

    • A latent style vector zR256z \in \mathbb{R}^{256}, learned via a CVAE conditioned on (h,y)(h, y), modulates hh through an elementwise affine transformation: h~=h(1+g(z))\tilde h = h \odot (1 + g(z)).
    • During training, zz is sampled from the variational posterior q(zh,y)q(z|h, y); at inference, its mean under the prior p(zh)p(z | h) is used.
    • The KL-divergence term for style regularization enters the final objective proportional to β\beta.
    • h~\tilde h replaces hh when computing P-scores in both training and test time; thus, P-score remains central to stochastic as well as deterministic inference.

A plausible implication is that the modularity of P-score computation enables seamless integration with both loss functions and uncertainty-aware calibration procedures (Taratynova et al., 21 Aug 2025).

5. Empirical Behavior and Calibration Performance

Empirical evaluation of P-score dynamics reveals its influence on model confidence and calibration:

  • Calibration Shifts:
    • Reliability diagrams show that following application of the CVAESM module, the distribution of maximum P-score (post-softmax) confidences shifts from overconfident 90–100% bins into better-calibrated 60–90% ranges.
    • Expected Calibration Error (ECE) is reduced from approximately 16% to 10%; Adaptive ECE decreases from about 40% to 33% after uncertainty modulation.
  • Decision Thresholding:
    • No explicit fixed P-score threshold is used; class prediction is always determined via the argmax over softmax-normalized P-scores.
  • Performance Impact:
    • Across temporal aggregation modules (GNN, xLSTM, TCN), contrastive regularization based on the P-score consistently improves macro F1 and AUC.
    • Introducing CVAESM style-modulation slightly decreases mean F1 (by about 1%) but yields substantial calibration gains.

This suggests that the P-score not only acts as a discriminative signal for class assignment but also underpins the model’s confidence estimation and reliability under uncertainty-aware inference (Taratynova et al., 21 Aug 2025).

6. Relation to Contrastive Learning and Discriminative Alignment

The P-score is the foundational measure of semantic alignment between video features and text prompt prototypes in the TPA architecture:

  • Contrastive Regularization:
    • The margin-hinge loss, operating on P-scores, enforces separation between true-class and non-true-class alignments. Specifically, s+s^+ must exceed ss^- by at least the margin mm, promoting clustering of intra-class embeddings around their respective prompts.
  • Optimal Hyperparameter Regimes:
    • Empirically, margin m=0.5m = 0.5 and weight α=0.5\alpha = 0.5 provide an optimal balance, yielding macro F1 ≈ 84.7% and AUC ≈ 87.6% on binary CHD detection. Extremes in mm or α\alpha can degrade discrimination either by over-separating or by insufficiently utilizing the prompt structure.
  • General Applicability:
    • The P-score driven contrastive term consistently improved macro F1 and AUC across task variants and extractor architectures, indicating robust effectiveness in class separation mediated by prompt alignment.

In summary, the Temporal Prompt Alignment Score (P-score) is the cosine similarity between the temporally aggregated video embedding and each class’s projected text prompt embedding. It is the central decision statistic for both softmax-based classification and for enforcing class-prototype clustering in the embedding space, with its behavior and downstream performance governed by careful choice of margin, contrastive loss weight, and temperature (Taratynova et al., 21 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Prompt Alignment Score (P-score).