SCALE-VLP: Soft-Weighted Vision–Language Pretraining
- SCALE-VLP is a vision–language framework that employs soft-weighted contrastive loss to fuse 3D volumetric medical data with textual reports.
- It integrates spatial semantics via 3D ViTs and explicit anatomical structure, ensuring precise cross-modal alignment between CT scans and radiology insights.
- The framework achieves significant gains in retrieval and generation metrics, demonstrating the efficacy of quality-driven, multimodal pretraining strategies.
SCALE-VLP refers to distinct frameworks in vision-language modeling that share a unifying focus on improving cross-modal learning via soft-weighted scoring, but are instantiated for different domains: SCALE-VLP for volumetric medical vision-language pretraining (Mahdizadeh et al., 4 Nov 2025), and SCALE-VLP (in the context of unified modality scoring) for quality-driven dataset curation for instruction tuning (Xu et al., 10 Jun 2025). Both frameworks introduce principled methodologies to overcome the limitations of binary or unimodal supervision, leveraging structured, continuous relevance metrics and task-aware alignment.
1. Soft-Weighted Contrastive Vision–Language Pre-Training for Volumetric Data
The SCALE-VLP framework (Mahdizadeh et al., 4 Nov 2025) addresses the challenge of representing, aligning, and transferring knowledge in 3D volumetric data—specifically, clinical CT volumes—paired with radiology reports. Most previous vision-LLMs (VLMs) are restricted to 2D inputs and rely on binary image-text supervision; this is inadequate for data with inherent spatial coherence and domain-structured semantics such as medical image series.
Model Architecture and Inputs
Volumetric Encoder: A frozen 3D ViT (pretrained on RadImageNet) encodes CT scans resampled to voxels, quantized to 8 bits. The volume is partitioned into non-overlapping cubic patches, each processed with 3D positional encodings. These patch tokens are aggregated by a lightweight transformer “vision head” to produce both spatially localized vision tokens and a global token . Outputs are linearly projected and -normalized.
Text Encoder: Free-text radiology reports are tokenized (BioClinicalBERT vocab) and passed through a fine-tuned BERT to extract a -dimensional, projected, -normalized embedding . All BERT weights and text projection layers are updated during pretraining.
Pairing and Alignment: Each CT volume is paired with its corresponding report . In each batch, embeddings and are aligned via a contrastive objective.
2. Soft-Weighted Contrastive Alignment Loss (SWCA)
Traditional binary supervision (as in InfoNCE or CLIP) is replaced by a continuously weighted, graded loss. The SWCA incorporates intra-modal similarity, cross-modal similarity, and weights informed by both structural and knowledge priors.
Key Formulations
- Intra-modal similarity:
where is a fixed scale.
- Cross-modal similarity (for vision and text embeddings, all -normalized):
with learnable temperature .
- SWCA loss (one-sided, vision-to-text):
where if and $0$ otherwise.
- Symmetric loss:
This formulation enables graded, non-binary supervision reflecting structured relationships in the data.
3. Integration of Volumetric Spatial Semantics
SCALE-VLP preserves anatomical structure via explicit 3D spatial statistics for each volume.
Spatial Kernel Construction
- Patch centroid and saliency:
- Spatial proximity kernel:
- Spatial similarity weights:
- Spatial SWCA loss:
is used in place of .
Structural consistency is implicitly enforced by operating over 3D centroid distributions and patch covariance, penalizing discordant spatial matches.
4. Incorporation of Domain-Aware, Knowledge-Infused Semantics
To capture the complex synonymy, compositional, and hierarchical relationships in radiological findings, SCALE-VLP injects external medical knowledge.
- Knowledge embeddings: Each report is processed by a frozen medical LLM (e.g., HuatuoGPT-o1 7B); final hidden states are mean-pooled and projected to .
- Knowledge similarity weights:
- Overall loss: A convex combination weights spatial and knowledge-aware terms:
with found optimal for balancing spatial and semantic priors.
This approach allows the model to simultaneously ground representations in anatomical structure and expert knowledge relationships.
5. Training, Implementation, and Evaluation
Training
- Datasets: CT-RATE (24,128 non-contrast chest CTs + reports) for pretraining; BIMCV-R (8,069 volumes + reports) for zero-shot evaluation.
- Preprocessing: Deterministic resampling to , quantization to 8 bits, NIfTI storage.
- Optimization: 10 epochs, AdamW (weight decay 0.1), LR=, 3% linear warm-up, cosine schedule, gradient clipping at 0.5, batch size 55 pairs/GPU 4 GPUs.
- Data augmentation: Only deterministic preprocessing; no heavy intensity or geometric augmentation.
Implementation
- Sigmoid-based SWCA avoids softmax, reducing VRAM and enabling larger batch sizes.
- All volumetric preprocessing is deterministic, with NIfTI compression to minimize I/O latency.
Evaluation
Benchmarks and major results include:
| Task | Metric/Score | SCALE-VLP | Baselines |
|---|---|---|---|
| CTReport | R@1 (CTRpt) | 13% | 2-6% (SOTA) |
| ROUGE-L | 0.4408 | 0.2823 (CT-CLIP), 0.3107 (M3D) | |
| Report Gen | BLEU-4 | 0.3485 | 0.1766 (CT-CLIP), 0.1695 (M3D) |
| BERT-F1 | 0.8934 | — | |
| Abnormality Cls | Acc/F1/AUC | 0.72/0.59/0.52 | fVLM: 0.69/0.58/0.51 |
| Zero-shot (BIMCV-R) | BLEU | 0.2406 | 0.2022 |
| ROUGE | 0.2231 | 0.1939 |
Removing spatial or knowledge priors substantially degrades recall and BLEU-4, supporting the necessity of both.
6. Unified Modality Scoring for Vision-Language Pre-Training
A separate but related SCALE-VLP framework (Xu et al., 10 Jun 2025) focuses on dataset curation for multimodal instruction tuning, introducing a unified, cross-modality scoring pipeline ("SCALE") addressing noisy alignment and ambiguous supervision.
Three-Stage Pipeline
- Task Assignment: Each image-text pair is assigned to one of tasks (e.g., OCR, reasoning) via prompting a large LLM (Qwen2.5-32B-Instruct). Formally,
- Caption Generation: Qwen2.5-VL-7B-Instruct generates:
- General caption (scene/environment/style)
- Task-specific caption (objects/relations for the assigned task)
- Quality Scoring: Computes
- Unimodal image quality ,
- Text quality ,
- Multimodal alignment (clarity, relevance, task rarity), using specialized judge models. Composite metrics are:
A data selection algorithm retains the top fraction of pairs by a weighted total score (), ensuring both high unimodal and cross-modal quality, and task diversity.
Quantitative Impact
Fine-tuning on the SCALE-selected top 10% improves or matches downstream performance across eight VLM benchmarks, outperforming full-dataset models and simple unimodal filtering:
| Model/Data | A-OKVQA | LLaVA Wild | ScienceQA | Avg |
|---|---|---|---|---|
| Full Data (500K) | 87.2 | 80.6 | 89.1 | 83.94 |
| SCALE-selected (50K) | 87.5 | 81.3 | 89.6 | 84.23 |
| IQA-only | — | — | — | 81.07 |
| TQA-only | — | — | — | 80.88 |
| (I+T) mean | — | — | — | 81.68 |
Ablation reveals that the multimodal scoring stage is essential for retaining informative "edge case" samples and achieving robust downstream reasoning.
7. Significance, Limitations, and Prospects
SCALE-VLP demonstrates that soft-weighted, multimodal alignment—grounded simultaneously in spatial and knowledge-based semantics—yields robust and transferable representations for volumetric medical vision-language tasks. Results indicate large gains in cross-modal retrieval, report generation, abnormality classification, and zero-shot transfer, highlighting the importance of precise, structured pairing over binary target or slice-wise preprocessing.
In the general VLP context, SCALE-VLP's data selection methodology reveals that careful, task-aware multimodal scoring surpasses unimodal curation, yielding better performance with fewer samples. This suggests a plausible direction for future VLM research: automated, cross-modality-driven filtering becomes increasingly vital as datasets scale and applications demand nuanced, high-fidelity supervision.
Both frameworks ground the evaluation and selection process in quantitative, modular metrics (clarity, relevance, spatial proximity, domain knowledge) rather than arbitrary or handcrafted labels. A plausible implication is the growing viability of highly automated, high-utility VLP pipelines across scientific imaging and broader vision-language domains.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free