Temporal Prompt Alignment Score (P-score)
- Temporal Prompt Alignment Score (P-score) is defined as the cosine similarity between a temporally aggregated video embedding and class-specific text prompt embeddings.
- It integrates temporal modeling, contrastive learning, and uncertainty calibration to enhance fetal CHD classification from ultrasound videos.
- Empirical results show that optimizing hyperparameters and incorporating CVAESM improves both discrimination and calibration, boosting metrics like F1 and AUC.
The Temporal Prompt Alignment Score ("P-score", Editor's term) is a metric that quantifies the alignment between a temporally aggregated video embedding and each class-specific text prompt embedding within the Temporal Prompt Alignment (TPA) framework for fetal congenital heart defect (CHD) classification from ultrasound videos. The P-score is formally defined as the cosine similarity between a video-level embedding produced by a temporal extractor and the projected embedding of a clinically motivated text prompt representing each candidate class. The P-score serves as the core building block for both classification and contrastive learning within this system, integrating temporal modeling, image-text alignment, and uncertainty calibration for robust video-based medical diagnosis (Taratynova et al., 21 Aug 2025).
1. Formal Definition
Let denote the feature vector for frame , extracted via a frozen image encoder such as EchoCLIP or FetalCLIP. For a subclip of length , these are stacked into . A lightweight temporal extractor (e.g., GNN, xLSTM, TCN) is used to aggregate these frame-level features, producing a video-level embedding .
For each class , let be the embedding of a class-specific text prompt encoded by a frozen text encoder. A learned projection yields .
The P-score for class is defined as
where is the standard Euclidean inner product and the norm. This cosine similarity measures the alignment between the temporally aggregated video features and each class prompt embedding (Taratynova et al., 21 Aug 2025).
2. Stepwise Computation Procedure
The computation of the P-score proceeds as follows:
- Frame-level Feature Extraction:
- consecutive video frames are sampled from the ultrasound sequence.
- For each frame, a frozen image encoder maps to .
- Temporal Aggregation:
- The sequence of frame features is passed through a temporal extractor: .
- Prompt Embedding Construction:
- For each class , a concise, task-specific text prompt (e.g., “Is the fetal heart normal in this 4CH ultrasound view?”) is encoded to via a frozen text encoder.
- A learned linear layer projects to video-embedding space: .
- Alignment Computation:
- Calculate the P-score for each class as the cosine similarity between and .
- Softmax Normalization (optional):
- The vector of P-scores can be temperature-softmaxed:
The highest determines the predicted class; values thus underlie both class assignment and estimated confidence (Taratynova et al., 21 Aug 2025).
3. Hyperparameters Impacting P-score Dynamics
Several key hyperparameters control the behavior and discriminative power of P-scores:
Margin : Used in the margin-hinge contrastive loss, enforcing a required gap between the true-class P-score and the hardest negative P-score.
Contrastive weight : Balances the contribution of the margin-hinge contrastive loss relative to the conventional cross-entropy classification loss.
Temperature : Controls the sharpness or flatness of the softmax over P-scores, affecting output confidence calibration.
CVAESM KL-weight : Sets the strength of the KL-divergence regularizer in the Conditional Variational Autoencoder Style Modulation module for uncertainty estimation.
Optimizing these hyperparameters is essential for achieving favorable trade-offs among discrimination, alignment, and calibration (Taratynova et al., 21 Aug 2025).
4. Training, Inference, and Calibration Integration
The P-score’s role in training, inference, and calibration is multifaceted:
Classification Loss:
- Standard cross-entropy is computed on temperature-softmaxed P-scores: , where is the one-hot true label.
- Margin-hinge Contrastive Loss:
- The positive P-score for the true class , and the hardest negative , yields
- The total loss (omitting uncertainty) is .
Uncertainty Quantification (CVAESM):
- A latent style vector , learned via a CVAE conditioned on , modulates through an elementwise affine transformation: .
- During training, is sampled from the variational posterior ; at inference, its mean under the prior is used.
- The KL-divergence term for style regularization enters the final objective proportional to .
- replaces when computing P-scores in both training and test time; thus, P-score remains central to stochastic as well as deterministic inference.
A plausible implication is that the modularity of P-score computation enables seamless integration with both loss functions and uncertainty-aware calibration procedures (Taratynova et al., 21 Aug 2025).
5. Empirical Behavior and Calibration Performance
Empirical evaluation of P-score dynamics reveals its influence on model confidence and calibration:
- Calibration Shifts:
- Reliability diagrams show that following application of the CVAESM module, the distribution of maximum P-score (post-softmax) confidences shifts from overconfident 90–100% bins into better-calibrated 60–90% ranges.
- Expected Calibration Error (ECE) is reduced from approximately 16% to 10%; Adaptive ECE decreases from about 40% to 33% after uncertainty modulation.
- Decision Thresholding:
- No explicit fixed P-score threshold is used; class prediction is always determined via the argmax over softmax-normalized P-scores.
- Performance Impact:
- Across temporal aggregation modules (GNN, xLSTM, TCN), contrastive regularization based on the P-score consistently improves macro F1 and AUC.
- Introducing CVAESM style-modulation slightly decreases mean F1 (by about 1%) but yields substantial calibration gains.
This suggests that the P-score not only acts as a discriminative signal for class assignment but also underpins the model’s confidence estimation and reliability under uncertainty-aware inference (Taratynova et al., 21 Aug 2025).
6. Relation to Contrastive Learning and Discriminative Alignment
The P-score is the foundational measure of semantic alignment between video features and text prompt prototypes in the TPA architecture:
- Contrastive Regularization:
- The margin-hinge loss, operating on P-scores, enforces separation between true-class and non-true-class alignments. Specifically, must exceed by at least the margin , promoting clustering of intra-class embeddings around their respective prompts.
- Optimal Hyperparameter Regimes:
- Empirically, margin and weight provide an optimal balance, yielding macro F1 ≈ 84.7% and AUC ≈ 87.6% on binary CHD detection. Extremes in or can degrade discrimination either by over-separating or by insufficiently utilizing the prompt structure.
- General Applicability:
- The P-score driven contrastive term consistently improved macro F1 and AUC across task variants and extractor architectures, indicating robust effectiveness in class separation mediated by prompt alignment.
In summary, the Temporal Prompt Alignment Score (P-score) is the cosine similarity between the temporally aggregated video embedding and each class’s projected text prompt embedding. It is the central decision statistic for both softmax-based classification and for enforcing class-prototype clustering in the embedding space, with its behavior and downstream performance governed by careful choice of margin, contrastive loss weight, and temperature (Taratynova et al., 21 Aug 2025).