Papers
Topics
Authors
Recent
Search
2000 character limit reached

CLIP-RT+AA: Radiology & Authorship Attribution

Updated 9 February 2026
  • The paper introduces CLIP-RT+AA, integrating radiology-tuned CLIP with an off-diagonal auto-adjustment loss to enhance normal/abnormal case discrimination.
  • The method employs a refined text filtering and pseudo-labeling pipeline to mitigate false negatives in abnormal reports.
  • The framework demonstrates significant AUC gains and improved retrieval performance in both radiology and multimodal authorship attribution settings.

CLIP-RT+AA refers to the integration of Contrastive Language-Image Pre-Training with a radiology-tuned backbone (CLIP-RT) and an Auto-Adjustment (AA) refinement that enhances authorship attribution (AA) and normal/abnormal case discrimination in radiological and multimodal detection settings. The term encapsulates frameworks that combine radiology-specialized CLIP models with structured off-diagonal loss terms or prompt augmentation to address class imbalance, low inter-sample variability in normal cases, and adversarial robustness. Key instantiations include the OFF-CLIP approach for medical vision-LLMs and CLIP-style retrieval in authorship attribution for domains such as online escort-advertisements.

1. Standard CLIP-RT in Radiology and Multimodal Authorship Attribution

The initial adaptation of CLIP to radiology (CLIP-RT) uses paired image–text data, with domain-specialized backbones such as ViT-B/16 (M3AE-pretrained) for images and BioBERT for reports (Park et al., 3 Mar 2025). The contrastive learning paradigm aligns joint embeddings so that true image-report pairs have maximal similarity, typically employing the InfoNCE loss: Lnc=1Bi=1B[logexp(Si,i)j=1Bexp(Si,j)+logexp(Si,i)j=1Bexp(Sj,i)],\mathcal{L}_{\mathrm{nc}} = -\frac{1}{B} \sum_{i=1}^B \left[ \log\frac{\exp(S_{i,i})}{\sum_{j=1}^B \exp(S_{i,j})} + \log\frac{\exp(S_{i,i})}{\sum_{j=1}^B \exp(S_{j,i})} \right], where Si,jS_{i,j} is the scaled cosine similarity between embeddings and BB is the batch size.

In multimodal authorship attribution (AA) such as the MATCHED dataset (Saxena et al., 2024), a similar CLIP-style retrieval-tuned (RT) baseline uses ViT-base and DeCLUTR-small for vision and text, with NT-Xent as its contrastive loss. These CLIP-RT architectures are evaluated on vendor identification and open-set retrieval, but face limitations in domains with low cross-modal semantic overlap.

2. Failure Modes in Normal Case Detection and Authorship Attribution

Standard CLIP contrastive learning in radiology suffers two principal failures (Park et al., 3 Mar 2025):

  • Normal Case Dispersion: Forcing all off-diagonal pairs apart, including normal–normal cases, disrupts clustering of normal samples. This yields high false positive rates for normal case detection.
  • Misaligned Text in Abnormal Reports: Abnormal reports often contain “normal” sentences (“no pneumonia”), which when treated as negative pairs with abnormal images induce false negatives.

In authorship attribution for domains like escort ads, the text and image have intentionally vague or non-overlapping semantics (Saxena et al., 2024). This undermines CLIP’s cross-modal alignment, leading to retrieval performance near random, with R-Precision << 0.01 for CLIP-ITC and BLIP2-style models.

3. Off-Diagonal Auto-Adjustment (AA) and Loss Formulation

The core Auto-Adjustment (AA) innovation is an off-diagonal term loss that provides soft, binary supervision encouraging semantic grouping of normal–normal pairs, while retaining conventional contrastive loss for abnormal (or, in AA, same-author) pairs (Park et al., 3 Mar 2025). Define a pseudo-label matrix Y^{0,1}B×B\hat{Y} \in \{0,1\}^{B \times B} with Y^i,j=1\hat{Y}_{i,j} = 1 for all normal–normal pairs and diagonals. The off-diagonal term loss is: Loff=12B2i=1Bj=1B[Y^i,jlogσ(Si,j)+(1Y^i,j)log(1σ(Si,j))]+(ij)\mathcal{L}_{\mathrm{off}} = -\frac{1}{2B^2} \sum_{i=1}^{B}\sum_{j=1}^{B} \left[ \hat{Y}_{i,j}\, \log \sigma(S_{i,j}) + (1-\hat{Y}_{i,j}) \log(1-\sigma(S_{i,j})) \right] + (i \leftrightarrow j) where σ\sigma is the sigmoid function.

For abnormal pairs, the abnormal-only InfoNCE is computed over the subset, and the total loss is: LOFF=Loff+λabLab,\mathcal{L}_{\mathrm{OFF}} = \mathcal{L}_{\mathrm{off}} + \lambda_{\mathrm{ab}} \mathcal{L}_{\mathrm{ab}}, with λab=1\lambda_{\mathrm{ab}} = 1 in practice.

This off-diagonal adjustment clusters all normal samples in the joint embedding space, directly reducing false positives, and sharpens the abnormal sample localization by preserving strong contrastive pressure among abnormal–abnormal and abnormal–normal samples.

4. Text Filtering and Pseudo-Labeling Pipeline

To mitigate false negatives from misleading normal sentences in abnormal reports, a sentence-level filtering pipeline is introduced (Park et al., 3 Mar 2025):

  1. GPT-4o templates the Findings/Impressions sections into atomic “There is {disease} / no {disease}” statements.
  2. Each sentence is pseudo-labeled as normal, abnormal, or uncertain using a pretrained sentence-level anomaly classifier.
  3. In abnormal reports, all normal/uncertain sentences are filtered out.
  4. Each image is paired only with truly abnormal sentences; Y^\hat{Y} is assigned based on these refined pairs.

This pipeline ensures pseudo-label integrity for the off-diagonal loss, which is crucial when working with weak text-image alignment or variable text content.

5. Empirical Gains and Quantitative Summary

OFF-CLIP (CLIP-RT+AA) yields large, measured performance gains in radiology and vision-language tasks (Park et al., 3 Mar 2025):

Dataset Baseline AUC (Normal) OFF-CLIP AUC (Normal) ΔAUC (Normal)
VinDr-CXR 0.25 0.86 +0.61
Open-I 0.32 0.74 +0.42

Overall total AUC on VinDr-CXR increases from 0.79 to 0.87. Zero-shot anomaly grounding is sharpened, with mean pointing game accuracy rising from ~0.70 to ~0.83. These improvements reflect substantial reductions in both false positives (normals misclassified as abnormal) and false negatives (abnormalities missed due to misleading text).

In the context of authorship attribution, CLIP-RT baselines without AA or supervised end-to-end training yield negligible retrieval/R-Precision, whereas joint CE+SupCon multimodal AA reaches R-Precision ≈ 0.98 and Macro-F1 ≈ 0.98, underscoring the necessity of embedding structure beyond vanilla contrastive alignment (Saxena et al., 2024).

6. Generalization, Extensions, and Implementation Considerations

The off-diagonal loss and filtering components of AA are architecture-agnostic and do not require model modifications, facilitating adaptation to multi-view X-rays, 3D imaging, or fine-grained detection by redefining the notion of “normal” clusters (Park et al., 3 Mar 2025). Key points for implementation:

  • The method is bottlenecked by pseudo-label quality; improvements in sentence-level anomaly detection directly propagate to final results.
  • The single hyperparameter λab\lambda_{\mathrm{ab}} may require tuning under extreme class-imbalance.
  • Learnable weighting between diagonal and off-diagonal supervision, and incorporation of soft (continuous) labels, remain open extensions.
  • In multimodal AA for low-alignment domains, end-to-end multi-task learning (joint cross-entropy and supervised contrastive objectives) outperforms any CLIP-style alignment, fusing stylometric and visual patterns into discriminative embeddings (Saxena et al., 2024).

7. Practical Recommendations and Impact

For best performance with CLIP-RT+AA frameworks, practitioners should:

  • Retain their existing CLIP radiology or retrieval backbone, adding only the off-diagonal loss and filtered text pairing module.
  • Integrate high-quality pseudo-labelers for sentence-level filtering.
  • Monitor normal versus abnormal AUC independently to detect class bias.
  • Validate not only classification metrics but also grounding/localization accuracy to ensure embedding improvements translate to interpretability.

The CLIP-RT+AA strategy advances robust discrimination and retrieval in domains (radiology, forensic authorship attribution) where normal inter-sample similarity or mismatched modality content invalidate naive contrastive modeling. As radiology and other domains adopt vision-language frameworks, these auto-adjustment strategies provide modular, scalable paths to improved reliability and practical utility (Park et al., 3 Mar 2025, Saxena et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CLIP-RT+AA.