SCALE-VLP: Soft-Weighted Vision–Language Pretraining

Updated 10 November 2025

SCALE-VLP is a vision–language framework that employs soft-weighted contrastive loss to fuse 3D volumetric medical data with textual reports.
It integrates spatial semantics via 3D ViTs and explicit anatomical structure, ensuring precise cross-modal alignment between CT scans and radiology insights.
The framework achieves significant gains in retrieval and generation metrics, demonstrating the efficacy of quality-driven, multimodal pretraining strategies.

SCALE-VLP refers to distinct frameworks in vision-language modeling that share a unifying focus on improving cross-modal learning via soft-weighted scoring, but are instantiated for different domains: SCALE-VLP for volumetric medical vision-language pretraining (Mahdizadeh et al., 4 Nov 2025), and SCALE-VLP (in the context of unified modality scoring) for quality-driven dataset curation for instruction tuning (Xu et al., 10 Jun 2025). Both frameworks introduce principled methodologies to overcome the limitations of binary or unimodal supervision, leveraging structured, continuous relevance metrics and task-aware alignment.

1. Soft-Weighted Contrastive Vision–Language Pre-Training for Volumetric Data

The SCALE-VLP framework (Mahdizadeh et al., 4 Nov 2025) addresses the challenge of representing, aligning, and transferring knowledge in 3D volumetric data—specifically, clinical CT volumes—paired with radiology reports. Most previous vision-LLMs (VLMs) are restricted to 2D inputs and rely on binary image-text supervision; this is inadequate for data with inherent spatial coherence and domain-structured semantics such as medical image series.

Model Architecture and Inputs

Volumetric Encoder: A frozen 3D ViT (pretrained on RadImageNet) encodes CT scans resampled to $256 \times 256 \times 32$ voxels, quantized to 8 bits. The volume is partitioned into $N$ non-overlapping cubic patches, each processed with 3D positional encodings. These patch tokens are aggregated by a lightweight transformer “vision head” to produce both spatially localized vision tokens $\{v_{i,m}\}_{m=1}^N$ and a global token $v_i^{CLS}$ . Outputs are linearly projected and $\ell_2$ -normalized.

Text Encoder: Free-text radiology reports are tokenized (BioClinicalBERT vocab) and passed through a fine-tuned BERT to extract a $D$ -dimensional, projected, $\ell_2$ -normalized embedding $t_i$ . All BERT weights and text projection layers are updated during pretraining.

Pairing and Alignment: Each CT volume $i$ is paired with its corresponding report $i$ . In each batch, embeddings $\{v_i\}_{i=1}^B$ and $\{t_i\}_{i=1}^B$ are aligned via a contrastive objective.

2. Soft-Weighted Contrastive Alignment Loss (SWCA)

Traditional binary supervision (as in InfoNCE or CLIP) is replaced by a continuously weighted, graded loss. The SWCA incorporates intra-modal similarity, cross-modal similarity, and weights informed by both structural and knowledge priors.

Key Formulations

Intra-modal similarity:

$a_{ij} = \exp\left(\beta\,\cos(z_i, z_j)\right),\quad a_{ii} = 0$

$w_{ij}^{\mathrm{Intra}} = \frac{a_{ij}}{\sum_{k\neq i} a_{ik} + \varepsilon},\quad w_{ii}^{\mathrm{Intra}} = 0$

where $\beta$ is a fixed scale.

Cross-modal similarity (for $B$ vision and $B$ text embeddings, all $\ell_2$ -normalized):

$s_{ij} = \tau\,(\hat v_i^\top \hat t_j)$

with learnable temperature $\tau$ .

SWCA loss (one-sided, vision-to-text):

$\mathcal{L}_{\mathrm{SWCA}^{V\to T}} = \frac{1}{B} \sum_{i=1}^B \sum_{j=1}^B \left[w_{ij}^{\mathrm{Intra}} + y_{ij}\right] \left[-y_{ij} \log \sigma(s_{ij}) - (1-y_{ij}) \log \left(1 - \sigma(s_{ij})\right)\right],$

where $y_{ij}=1$ if $i=j$ and $0$ otherwise.

Symmetric loss:

$\mathcal{L}_{\mathrm{SWCA}} = 0.5\,\left(\mathcal{L}_{\mathrm{SWCA}^{V\to T}} + \mathcal{L}_{\mathrm{SWCA}^{T\to V}}\right)$

This formulation enables graded, non-binary supervision reflecting structured relationships in the data.

3. Integration of Volumetric Spatial Semantics

SCALE-VLP preserves anatomical structure via explicit 3D spatial statistics for each volume.

Spatial Kernel Construction

Patch centroid and saliency:

$\alpha_{i,m} = \frac{r_{i,m}}{\sum_{n=1}^N r_{i,n}}$

$\mu_i = \sum_{m=1}^N \alpha_{i,m} c_{i,m},\quad \Sigma_i = \sum_{m=1}^N \alpha_{i,m} (c_{i,m} - \mu_i)(c_{i,m} - \mu_i)^\top$

Spatial proximity kernel:

$p_{ij} = \exp\left( -\frac{ \|\mu_i - \mu_j\|_2^2 }{2\kappa_\mu^2} \right) \cdot \exp \left( -\frac{ \|\Sigma_i - \Sigma_j\|_F^2 }{2\kappa_\Sigma^2} \right)$

Spatial similarity weights:

$\delta_{ij} = w_{ij}^{\mathrm{Intra}}\,p_{ij},\quad w_{ij}^{\mathrm{spatial}} = \frac{\delta_{ij}}{\sum_k \delta_{ik} + \varepsilon}$

Spatial SWCA loss:

$w_{ij}^{\mathrm{spatial}}$ is used in place of $w_{ij}^{\mathrm{Intra}}$ .

Structural consistency is implicitly enforced by operating over 3D centroid distributions and patch covariance, penalizing discordant spatial matches.

4. Incorporation of Domain-Aware, Knowledge-Infused Semantics

To capture the complex synonymy, compositional, and hierarchical relationships in radiological findings, SCALE-VLP injects external medical knowledge.

Knowledge embeddings: Each report is processed by a frozen medical LLM (e.g., HuatuoGPT-o1 7B); final hidden states are mean-pooled and projected to $h_i$ .
Knowledge similarity weights:

$a_{ij}^{\mathrm{knowledge}} = \exp(\beta\,\cos(h_i, h_j)),\qquad w_{ij}^{\mathrm{knowledge}} = \frac{a_{ij}^{\mathrm{knowledge}}}{\sum_{k\neq i} a_{ik}^{\mathrm{knowledge}} + \varepsilon}$

Overall loss: A convex combination weights spatial and knowledge-aware terms:

$\mathcal{L} = \alpha\,\mathcal{L}_{\mathrm{SWCA}^{\mathrm{spatial}}} + (1-\alpha)\,\mathcal{L}_{\mathrm{SWCA}^{\mathrm{knowledge}}},\quad \alpha \in [0,1]$

with $\alpha=0.5$ found optimal for balancing spatial and semantic priors.

This approach allows the model to simultaneously ground representations in anatomical structure and expert knowledge relationships.

5. Training, Implementation, and Evaluation

Training

Datasets: CT-RATE (24,128 non-contrast chest CTs + reports) for pretraining; BIMCV-R (8,069 volumes + reports) for zero-shot evaluation.
Preprocessing: Deterministic resampling to $256\times256\times32$ , quantization to 8 bits, NIfTI storage.
Optimization: 10 epochs, AdamW (weight decay 0.1), LR= $10^{-4}$ , 3% linear warm-up, cosine schedule, gradient clipping at 0.5, batch size 55 pairs/GPU $\times$ 4 GPUs.
Data augmentation: Only deterministic preprocessing; no heavy intensity or geometric augmentation.

Implementation

Sigmoid-based SWCA avoids $B\times B$ softmax, reducing VRAM and enabling larger batch sizes.
All volumetric preprocessing is deterministic, with NIfTI compression to minimize I/O latency.

Evaluation

Benchmarks and major results include:

Task	Metric/Score	SCALE-VLP	Baselines
CT $\leftrightarrow$ Report	R@1 (CT $\to$ Rpt)	13%	2-6% (SOTA)
	ROUGE-L	0.4408	0.2823 (CT-CLIP), 0.3107 (M3D)
Report Gen	BLEU-4	0.3485	0.1766 (CT-CLIP), 0.1695 (M3D)
	BERT-F1	0.8934	—
Abnormality Cls	Acc/F1/AUC	0.72/0.59/0.52	fVLM: 0.69/0.58/0.51
Zero-shot (BIMCV-R)	BLEU	0.2406	0.2022
	ROUGE	0.2231	0.1939

Removing spatial or knowledge priors substantially degrades recall and BLEU-4, supporting the necessity of both.

6. Unified Modality Scoring for Vision-Language Pre-Training

A separate but related SCALE-VLP framework (Xu et al., 10 Jun 2025) focuses on dataset curation for multimodal instruction tuning, introducing a unified, cross-modality scoring pipeline ("SCALE") addressing noisy alignment and ambiguous supervision.

Three-Stage Pipeline

Task Assignment: Each image-text pair is assigned to one of $K$ tasks (e.g., OCR, reasoning) via prompting a large LLM (Qwen2.5-32B-Instruct). Formally,

$\mathsf{Task}(T) = \arg\max_{k} P(\mathrm{Task}_k\mid T, \mathrm{Prompt})$

Caption Generation: Qwen2.5-VL-7B-Instruct generates:
- General caption $C_{\rm gen}$ (scene/environment/style)
- Task-specific caption $C_{\rm spec}$ (objects/relations for the assigned task)
Quality Scoring: Computes
- Unimodal image quality $S_I$ ,
- Text quality $S_T$ ,
- Multimodal alignment $S_{MM}$ (clarity, relevance, task rarity), using specialized judge models. Composite metrics are:
$S_I = \frac{5-\mathrm{BLUR}(I) + 5-\mathrm{NOISE}(I)}{2}$

$S_T = \frac{\mathrm{INFO}(T) + \mathrm{CPXT}(T) + \mathrm{CPLT}(T)}{3}$

$S_{MM} = 0.8 S_{\rm align} + 0.2 S_{\rm rarity}$

A data selection algorithm retains the top $r$ fraction of pairs by a weighted total score ( $0.2 S_I + 0.2 S_T + 0.6 S_{MM}$ ), ensuring both high unimodal and cross-modal quality, and task diversity.

Quantitative Impact

Fine-tuning on the SCALE-selected top 10% improves or matches downstream performance across eight VLM benchmarks, outperforming full-dataset models and simple unimodal filtering:

Model/Data	A-OKVQA	LLaVA Wild	ScienceQA	Avg
Full Data (500K)	87.2	80.6	89.1	83.94
SCALE-selected (50K)	87.5	81.3	89.6	84.23
IQA-only	—	—	—	81.07
TQA-only	—	—	—	80.88
(I+T) mean	—	—	—	81.68

Ablation reveals that the multimodal scoring stage is essential for retaining informative "edge case" samples and achieving robust downstream reasoning.

7. Significance, Limitations, and Prospects

SCALE-VLP demonstrates that soft-weighted, multimodal alignment—grounded simultaneously in spatial and knowledge-based semantics—yields robust and transferable representations for volumetric medical vision-language tasks. Results indicate large gains in cross-modal retrieval, report generation, abnormality classification, and zero-shot transfer, highlighting the importance of precise, structured pairing over binary target or slice-wise preprocessing.

In the general VLP context, SCALE-VLP's data selection methodology reveals that careful, task-aware multimodal scoring surpasses unimodal curation, yielding better performance with fewer samples. This suggests a plausible direction for future VLM research: automated, cross-modality-driven filtering becomes increasingly vital as datasets scale and applications demand nuanced, high-fidelity supervision.

Both frameworks ground the evaluation and selection process in quantitative, modular metrics (clarity, relevance, spatial proximity, domain knowledge) rather than arbitrary or handcrafted labels. A plausible implication is the growing viability of highly automated, high-utility VLP pipelines across scientific imaging and broader vision-language domains.

PDF Markdown Chat (Pro)

References (2)

SCALE-VLP: Soft-Weighted Contrastive Volumetric Vision-Language Pre-training with Spatial-Knowledge Semantics (2025)

Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SCALE-VLP.