Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

SCALE-VLP: Soft-Weighted Vision–Language Pretraining

Updated 10 November 2025
  • SCALE-VLP is a vision–language framework that employs soft-weighted contrastive loss to fuse 3D volumetric medical data with textual reports.
  • It integrates spatial semantics via 3D ViTs and explicit anatomical structure, ensuring precise cross-modal alignment between CT scans and radiology insights.
  • The framework achieves significant gains in retrieval and generation metrics, demonstrating the efficacy of quality-driven, multimodal pretraining strategies.

SCALE-VLP refers to distinct frameworks in vision-language modeling that share a unifying focus on improving cross-modal learning via soft-weighted scoring, but are instantiated for different domains: SCALE-VLP for volumetric medical vision-language pretraining (Mahdizadeh et al., 4 Nov 2025), and SCALE-VLP (in the context of unified modality scoring) for quality-driven dataset curation for instruction tuning (Xu et al., 10 Jun 2025). Both frameworks introduce principled methodologies to overcome the limitations of binary or unimodal supervision, leveraging structured, continuous relevance metrics and task-aware alignment.

1. Soft-Weighted Contrastive Vision–Language Pre-Training for Volumetric Data

The SCALE-VLP framework (Mahdizadeh et al., 4 Nov 2025) addresses the challenge of representing, aligning, and transferring knowledge in 3D volumetric data—specifically, clinical CT volumes—paired with radiology reports. Most previous vision-LLMs (VLMs) are restricted to 2D inputs and rely on binary image-text supervision; this is inadequate for data with inherent spatial coherence and domain-structured semantics such as medical image series.

Model Architecture and Inputs

Volumetric Encoder: A frozen 3D ViT (pretrained on RadImageNet) encodes CT scans resampled to 256×256×32256 \times 256 \times 32 voxels, quantized to 8 bits. The volume is partitioned into NN non-overlapping cubic patches, each processed with 3D positional encodings. These patch tokens are aggregated by a lightweight transformer “vision head” to produce both spatially localized vision tokens {vi,m}m=1N\{v_{i,m}\}_{m=1}^N and a global token viCLSv_i^{CLS}. Outputs are linearly projected and 2\ell_2-normalized.

Text Encoder: Free-text radiology reports are tokenized (BioClinicalBERT vocab) and passed through a fine-tuned BERT to extract a DD-dimensional, projected, 2\ell_2-normalized embedding tit_i. All BERT weights and text projection layers are updated during pretraining.

Pairing and Alignment: Each CT volume ii is paired with its corresponding report ii. In each batch, embeddings {vi}i=1B\{v_i\}_{i=1}^B and {ti}i=1B\{t_i\}_{i=1}^B are aligned via a contrastive objective.

2. Soft-Weighted Contrastive Alignment Loss (SWCA)

Traditional binary supervision (as in InfoNCE or CLIP) is replaced by a continuously weighted, graded loss. The SWCA incorporates intra-modal similarity, cross-modal similarity, and weights informed by both structural and knowledge priors.

Key Formulations

  • Intra-modal similarity:

aij=exp(βcos(zi,zj)),aii=0a_{ij} = \exp\left(\beta\,\cos(z_i, z_j)\right),\quad a_{ii} = 0

wijIntra=aijkiaik+ε,wiiIntra=0w_{ij}^{\mathrm{Intra}} = \frac{a_{ij}}{\sum_{k\neq i} a_{ik} + \varepsilon},\quad w_{ii}^{\mathrm{Intra}} = 0

where β\beta is a fixed scale.

  • Cross-modal similarity (for BB vision and BB text embeddings, all 2\ell_2-normalized):

sij=τ(v^it^j)s_{ij} = \tau\,(\hat v_i^\top \hat t_j)

with learnable temperature τ\tau.

  • SWCA loss (one-sided, vision-to-text):

LSWCAVT=1Bi=1Bj=1B[wijIntra+yij][yijlogσ(sij)(1yij)log(1σ(sij))],\mathcal{L}_{\mathrm{SWCA}^{V\to T}} = \frac{1}{B} \sum_{i=1}^B \sum_{j=1}^B \left[w_{ij}^{\mathrm{Intra}} + y_{ij}\right] \left[-y_{ij} \log \sigma(s_{ij}) - (1-y_{ij}) \log \left(1 - \sigma(s_{ij})\right)\right],

where yij=1y_{ij}=1 if i=ji=j and $0$ otherwise.

  • Symmetric loss:

LSWCA=0.5(LSWCAVT+LSWCATV)\mathcal{L}_{\mathrm{SWCA}} = 0.5\,\left(\mathcal{L}_{\mathrm{SWCA}^{V\to T}} + \mathcal{L}_{\mathrm{SWCA}^{T\to V}}\right)

This formulation enables graded, non-binary supervision reflecting structured relationships in the data.

3. Integration of Volumetric Spatial Semantics

SCALE-VLP preserves anatomical structure via explicit 3D spatial statistics for each volume.

Spatial Kernel Construction

  • Patch centroid and saliency:

αi,m=ri,mn=1Nri,n\alpha_{i,m} = \frac{r_{i,m}}{\sum_{n=1}^N r_{i,n}}

μi=m=1Nαi,mci,m,Σi=m=1Nαi,m(ci,mμi)(ci,mμi)\mu_i = \sum_{m=1}^N \alpha_{i,m} c_{i,m},\quad \Sigma_i = \sum_{m=1}^N \alpha_{i,m} (c_{i,m} - \mu_i)(c_{i,m} - \mu_i)^\top

  • Spatial proximity kernel:

pij=exp(μiμj222κμ2)exp(ΣiΣjF22κΣ2)p_{ij} = \exp\left( -\frac{ \|\mu_i - \mu_j\|_2^2 }{2\kappa_\mu^2} \right) \cdot \exp \left( -\frac{ \|\Sigma_i - \Sigma_j\|_F^2 }{2\kappa_\Sigma^2} \right)

  • Spatial similarity weights:

δij=wijIntrapij,wijspatial=δijkδik+ε\delta_{ij} = w_{ij}^{\mathrm{Intra}}\,p_{ij},\quad w_{ij}^{\mathrm{spatial}} = \frac{\delta_{ij}}{\sum_k \delta_{ik} + \varepsilon}

  • Spatial SWCA loss:

wijspatialw_{ij}^{\mathrm{spatial}} is used in place of wijIntraw_{ij}^{\mathrm{Intra}}.

Structural consistency is implicitly enforced by operating over 3D centroid distributions and patch covariance, penalizing discordant spatial matches.

4. Incorporation of Domain-Aware, Knowledge-Infused Semantics

To capture the complex synonymy, compositional, and hierarchical relationships in radiological findings, SCALE-VLP injects external medical knowledge.

  • Knowledge embeddings: Each report is processed by a frozen medical LLM (e.g., HuatuoGPT-o1 7B); final hidden states are mean-pooled and projected to hih_i.
  • Knowledge similarity weights:

aijknowledge=exp(βcos(hi,hj)),wijknowledge=aijknowledgekiaikknowledge+εa_{ij}^{\mathrm{knowledge}} = \exp(\beta\,\cos(h_i, h_j)),\qquad w_{ij}^{\mathrm{knowledge}} = \frac{a_{ij}^{\mathrm{knowledge}}}{\sum_{k\neq i} a_{ik}^{\mathrm{knowledge}} + \varepsilon}

  • Overall loss: A convex combination weights spatial and knowledge-aware terms:

L=αLSWCAspatial+(1α)LSWCAknowledge,α[0,1]\mathcal{L} = \alpha\,\mathcal{L}_{\mathrm{SWCA}^{\mathrm{spatial}}} + (1-\alpha)\,\mathcal{L}_{\mathrm{SWCA}^{\mathrm{knowledge}}},\quad \alpha \in [0,1]

with α=0.5\alpha=0.5 found optimal for balancing spatial and semantic priors.

This approach allows the model to simultaneously ground representations in anatomical structure and expert knowledge relationships.

5. Training, Implementation, and Evaluation

Training

  • Datasets: CT-RATE (24,128 non-contrast chest CTs + reports) for pretraining; BIMCV-R (8,069 volumes + reports) for zero-shot evaluation.
  • Preprocessing: Deterministic resampling to 256×256×32256\times256\times32, quantization to 8 bits, NIfTI storage.
  • Optimization: 10 epochs, AdamW (weight decay 0.1), LR=10410^{-4}, 3% linear warm-up, cosine schedule, gradient clipping at 0.5, batch size 55 pairs/GPU ×\times 4 GPUs.
  • Data augmentation: Only deterministic preprocessing; no heavy intensity or geometric augmentation.

Implementation

  • Sigmoid-based SWCA avoids B×BB\times B softmax, reducing VRAM and enabling larger batch sizes.
  • All volumetric preprocessing is deterministic, with NIfTI compression to minimize I/O latency.

Evaluation

Benchmarks and major results include:

Task Metric/Score SCALE-VLP Baselines
CT\leftrightarrowReport R@1 (CT\toRpt) 13% 2-6% (SOTA)
ROUGE-L 0.4408 0.2823 (CT-CLIP), 0.3107 (M3D)
Report Gen BLEU-4 0.3485 0.1766 (CT-CLIP), 0.1695 (M3D)
BERT-F1 0.8934
Abnormality Cls Acc/F1/AUC 0.72/0.59/0.52 fVLM: 0.69/0.58/0.51
Zero-shot (BIMCV-R) BLEU 0.2406 0.2022
ROUGE 0.2231 0.1939

Removing spatial or knowledge priors substantially degrades recall and BLEU-4, supporting the necessity of both.

6. Unified Modality Scoring for Vision-Language Pre-Training

A separate but related SCALE-VLP framework (Xu et al., 10 Jun 2025) focuses on dataset curation for multimodal instruction tuning, introducing a unified, cross-modality scoring pipeline ("SCALE") addressing noisy alignment and ambiguous supervision.

Three-Stage Pipeline

  1. Task Assignment: Each image-text pair is assigned to one of KK tasks (e.g., OCR, reasoning) via prompting a large LLM (Qwen2.5-32B-Instruct). Formally,

Task(T)=argmaxkP(TaskkT,Prompt)\mathsf{Task}(T) = \arg\max_{k} P(\mathrm{Task}_k\mid T, \mathrm{Prompt})

  1. Caption Generation: Qwen2.5-VL-7B-Instruct generates:
    • General caption CgenC_{\rm gen} (scene/environment/style)
    • Task-specific caption CspecC_{\rm spec} (objects/relations for the assigned task)
  2. Quality Scoring: Computes

    • Unimodal image quality SIS_I,
    • Text quality STS_T,
    • Multimodal alignment SMMS_{MM} (clarity, relevance, task rarity), using specialized judge models. Composite metrics are:

    SI=5BLUR(I)+5NOISE(I)2S_I = \frac{5-\mathrm{BLUR}(I) + 5-\mathrm{NOISE}(I)}{2}

    ST=INFO(T)+CPXT(T)+CPLT(T)3S_T = \frac{\mathrm{INFO}(T) + \mathrm{CPXT}(T) + \mathrm{CPLT}(T)}{3}

    SMM=0.8Salign+0.2SrarityS_{MM} = 0.8 S_{\rm align} + 0.2 S_{\rm rarity}

A data selection algorithm retains the top rr fraction of pairs by a weighted total score (0.2SI+0.2ST+0.6SMM0.2 S_I + 0.2 S_T + 0.6 S_{MM}), ensuring both high unimodal and cross-modal quality, and task diversity.

Quantitative Impact

Fine-tuning on the SCALE-selected top 10% improves or matches downstream performance across eight VLM benchmarks, outperforming full-dataset models and simple unimodal filtering:

Model/Data A-OKVQA LLaVA Wild ScienceQA Avg
Full Data (500K) 87.2 80.6 89.1 83.94
SCALE-selected (50K) 87.5 81.3 89.6 84.23
IQA-only 81.07
TQA-only 80.88
(I+T) mean 81.68

Ablation reveals that the multimodal scoring stage is essential for retaining informative "edge case" samples and achieving robust downstream reasoning.

7. Significance, Limitations, and Prospects

SCALE-VLP demonstrates that soft-weighted, multimodal alignment—grounded simultaneously in spatial and knowledge-based semantics—yields robust and transferable representations for volumetric medical vision-language tasks. Results indicate large gains in cross-modal retrieval, report generation, abnormality classification, and zero-shot transfer, highlighting the importance of precise, structured pairing over binary target or slice-wise preprocessing.

In the general VLP context, SCALE-VLP's data selection methodology reveals that careful, task-aware multimodal scoring surpasses unimodal curation, yielding better performance with fewer samples. This suggests a plausible direction for future VLM research: automated, cross-modality-driven filtering becomes increasingly vital as datasets scale and applications demand nuanced, high-fidelity supervision.

Both frameworks ground the evaluation and selection process in quantitative, modular metrics (clarity, relevance, spatial proximity, domain knowledge) rather than arbitrary or handcrafted labels. A plausible implication is the growing viability of highly automated, high-utility VLP pipelines across scientific imaging and broader vision-language domains.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SCALE-VLP.