EchoVLM: Medical Vision-Language Models

Updated 4 July 2026

EchoVLM represents two specialized medical vision-language models: one for echocardiography with measurement-grounded pretraining and one for ultrasound imaging using a Dual-path MoE.
The echocardiography variant leverages structured measurement extraction and view-informed contrastive losses to enhance diagnostic accuracy and view classification.
The ultrasound variant employs a Dynamic Mixture-of-Experts with multi-center training to boost report generation, diagnosis, and visual question answering across diverse anatomical regions.

EchoVLM denotes more than one medical vision-LLM in the recent arXiv literature rather than a single standardized architecture. One usage refers to a measurement-grounded echocardiography foundation model built around EchoGround-MIMIC and clinically grounded multimodal pretraining (Li et al., 13 Dec 2025). Another refers to a Dynamic Mixture-of-Experts ultrasound model based on Qwen2-VL for report generation, diagnosis, and visual question answering across seven anatomical regions (She et al., 18 Sep 2025). A common naming confusion arises with the autonomous-driving system EchoVLA; the corresponding paper consistently uses EchoVLA, not EchoVLM (Guo et al., 17 Jan 2026).

1. Disambiguation and nomenclature

In the material summarized here, EchoVLM is an overloaded name. It is used officially by two different 2025 medical-imaging papers, while adjacent “Echo”-prefixed systems belong to other problem domains and model classes. This suggests that the term should be interpreted contextually, with the application domain and cited paper determining the intended system.

Usage	Domain	Defining characteristic
EchoVLM (Li et al., 13 Dec 2025)	Echocardiography	Measurement-grounded multimodal pretraining
EchoVLM (She et al., 18 Sep 2025)	Ultrasound imaging	Dynamic Mixture-of-Experts on Qwen2-VL
EchoVLA (Guo et al., 17 Jan 2026)	Autonomous driving	Vision-Language-Action with audio instructions

The first EchoVLM is presented as a clinically grounded vision-LLM for echocardiography, motivated by the claim that echo interpretation depends on standardized views, embedded quantitative measurements, negated findings, and guideline-based disease severity (Li et al., 13 Dec 2025). The second EchoVLM is introduced as an ultrasound-specialized vision-LLM for “universal ultrasound intelligence,” emphasizing multi-organ lesion recognition, report generation, diagnosis summarization, and VQA across heterogeneous sonographic data (She et al., 18 Sep 2025). By contrast, the driving paper explicitly frames its system as a Vision-Language-Action model and states that EchoVLA is the correct name because the method predicts trajectories and actions rather than functioning as a pure VLM (Guo et al., 17 Jan 2026).

2. Measurement-grounded EchoVLM for echocardiography

The echocardiography EchoVLM is designed to make multimodal pretraining more faithful to actual echocardiographic reading workflows. Its clinical motivation is that a typical echo exam is interpreted by first identifying the acquisition view, then reading quantitative values such as ejection fraction, chamber sizes, or valve gradients, then combining those measurements with qualitative observations to assign graded abnormalities and diagnoses (Li et al., 13 Dec 2025). The paper argues that standard CLIP-style pretraining with free-text reports does not adequately encode this measurement-driven logic.

The central dataset contribution is EchoGround-MIMIC, described as the first measurement-grounded multimodal echocardiography dataset. It contains 19,065 image-text pairs from 1,572 patients curated from MIMIC-IV-ECHO and MIMIC-IV-Note. The construction pipeline first classifies the standardized ASE view, then OCRs the embedded on-image measurement overlays, then uses a large vision-LLM to transcribe those overlays into structured JSON, and finally uses another LLM to extract only those report sentences that explicitly depend on, or are directly supported by, the extracted measurements (Li et al., 13 Dec 2025). The final dataset contains 9 ASE-defined disease categories, each graded ordinally from normal to severe, plus one of 22 standardized views.

The measurement extraction pipeline reduces raw OCR noise from 1,232 unique keys to 167 standardized measurements grouped into 11 anatomical categories: LV, LA, RV, RA, MV, TV, AV, PV, SV, pulmonary vein, and aorta. The paper reports that the final labels agree with rule-based checks in about 87% of cases, and that the remaining roughly 13% are borderline or subjective. This grounding is presented as the key distinction from report-aligned supervision: the text is explicitly anchored to the numbers displayed on the echo screen rather than to broad narrative report content (Li et al., 13 Dec 2025).

A representative example in the paper couples an A2C image with a structured measurement such as LAESV index 58.65 mL/m², a measurement-grounded caption such as “The left atrial volume index is severely increased,” and disease labels such as left atrial dilation and arrhythmia risk. Another example uses a parasternal long-axis view with measurements including IVSd, LVIDd, LVPWd, and LA dimensions, together with a caption indicating mild symmetric LV hypertrophy. These examples illustrate the intended supervision regime: view-specific anatomy, extracted measurements, grounded captioning, and guideline-derived labels (Li et al., 13 Dec 2025).

3. Objectives, architecture, and transfer behavior of the echocardiography model

Architecturally, this EchoVLM is built on a CLIP-style dual encoder. The image backbone is ViT-B, initialized from weights pretrained on an internal echo dataset, and the text encoder is initialized from EchoCLIP. Training uses a global batch size of 512, AdamW, learning rate 1e-4, weight decay 0.05, 20 pretraining epochs, a 200-step linear warmup, and 112×112 image resolution (Li et al., 13 Dec 2025).

Its total objective augments standard image-text contrastive learning with two domain-specific losses:

$L = L_{\text{CLIP}} + \lambda_{\text{view}} L_{\text{view}} + \lambda_{\text{neg}} L_{\text{neg}},$

with $\lambda_{\text{view}} = 0.5$ and $\lambda_{\text{neg}} = 0.1$ in the main setup. The view-informed contrastive loss treats all same-view images in the batch as positives and all different-view images as negatives, encouraging intra-view compactness and inter-view separation. The negation-aware contrastive loss pushes the model to assign low similarity to affirmative-versus-negated caption pairs, addressing clinically critical distinctions such as “no systolic dysfunction” versus “mild left ventricular systolic dysfunction” (Li et al., 13 Dec 2025).

The downstream evaluation spans five types of clinical applications with 36 tasks: multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection. On zero-shot disease classification on EchoGround-MIMIC, EchoVLM reaches 86.5% AUC, 34.2% precision, and 86.2% recall. For retrieval, it achieves top-5 recall of 2.98% and top-10 recall of 5.70%. On view classification, it reaches 95.1% accuracy, 95.3% F1, and 95.8% precision. On segmentation, it reports Dice scores of 93.1 on EchoNet-Dynamic, 92.4 on EchoNet-Pediatric A4C, 93.1 on EchoNet-Pediatric PSAX, 93.8 on CAMUS left ventricle segmentation, and 90.2 on CAMUS left atrium segmentation. On EchoNet-LVH landmark detection, reported examples include IVS average landmark error 1.70 mm and MAE 4.15 mm, and LVID average landmark error 3.04 mm and MAE 5.30 mm (Li et al., 13 Dec 2025).

The ablations emphasize that the most important gains do not come from generic multimodal alignment alone. When pretraining uses raw reports instead of curated grounded captions, zero-shot disease classification drops from 79.6% AUC to 54.4%, with precision dropping from 29.0% to 12.5%. Adding only $L_{\text{view}}$ to CLIP improves AUC from 79.6 to 80.7 and view accuracy from 92.9 to 94.0. Adding only $L_{\text{neg}}$ improves AUC to 83.3 and recall to 81.6. Combining both yields the full 86.5 AUC and 95.1% view accuracy (Li et al., 13 Dec 2025).

The paper is also explicit about limitations. EchoGround-MIMIC is derived from a single health system and a restricted time window; OCR can introduce parsing noise; caption generation and label extraction depend on LLMs rather than expert adjudication; and some clinically important categories requiring richer Doppler or hemodynamic context were excluded because the available measurement extraction was insufficient. Future work is directed toward multi-institutional data and temporal modeling (Li et al., 13 Dec 2025).

4. Dynamic Mixture-of-Experts EchoVLM for universal ultrasound intelligence

The second official EchoVLM is an ultrasound-specialized vision-LLM intended to address organ-specific reasoning, multi-image context, and clinically standardized outputs across multiple sonographic tasks (She et al., 18 Sep 2025). The paper argues that models such as LLaVA-OneVision and Qwen2-VL, even when adapted to medicine, lack sufficient ultrasound-specific knowledge and struggle to generalize across multiple organs.

This EchoVLM is trained on a multicenter corpus collected from 15 hospitals. The dataset includes 208,941 clinical cases, 1.47 million key ultrasound frames, and 7 anatomical regions: thyroid, breast, liver, kidney, gynecology, vessel, and heart. The paper describes filtering that keeps only single-region images, removes images without corresponding reports, manually excludes reports unrelated to the target region, and removes reports without matching images. It further constructs 1.8 million instruction-tuning pairs through a Self-Instruct-style pipeline using 21 expert-designed templates spanning ultrasound descriptions, ultrasound diagnosis summaries, and question-answer pairs (She et al., 18 Sep 2025).

The architecture is built on Qwen2-7B with a CLIP-ViT-L/14 visual encoder, an MLP vision-to-text projection, and Dual-path MoE modules inserted into transformer blocks. Given an ultrasound frame $v \in \mathbb{R}^{H \times W \times 3}$ , the visual encoder produces visual tokens $V \in \mathbb{R}^{M \times C}$ with $M = HW / 14^2$ , which are projected to the LLM dimension. Text is embedded as $T \in \mathbb{R}^{N \times D}$ , and the concatenated multimodal input is $X_0 = [V;T] \in \mathbb{R}^{(M+N)\times D}$ (She et al., 18 Sep 2025).

The key innovation is the Dual-path MoE. One path is a frozen copy of the original Qwen2 FFN, which serves as a stability anchor for general language and multimodal knowledge. The active path contains a shared expert $\lambda_{\text{view}} = 0.5$ 0 that processes every token and routing experts $\lambda_{\text{view}} = 0.5$ 1 activated by Top-k routing. The paper writes the combined output as

$\lambda_{\text{view}} = 0.5$ 2

The appendix specifies Top-2 over 4 routing experts, a shared expert dimension of 5632, and routing expert dimension 1408 (She et al., 18 Sep 2025). The stated intuition is to preserve generic multimodal reasoning while injecting ultrasound specialization.

5. Training regime, benchmarks, and empirical profile of the ultrasound model

Training proceeds in two stages. In Stage I, all original Qwen2-VL parameters are frozen and only the new MoE modules are trained, with the goal of acquiring ultrasound-specific visual and textual patterns without damaging the base model. In Stage II, the model uses LoRA to adapt the base model lightly while MoE parameters continue full updates. The paper defines the LoRA update as

$\lambda_{\text{view}} = 0.5$ 3

The total loss combines autoregressive language modeling and a load-balancing term:

$\lambda_{\text{view}} = 0.5$ 4

with appendix value $\lambda_{\text{view}} = 0.5$ 5 (She et al., 18 Sep 2025).

Implementation details include 392×392 image resolution, PatchMerger rate 4, LoRA rank 8, LoRA alpha 16, LoRA dropout 0.05, AdamW, weight decay 0.0, learning rates of 1e-3 in Stage I and 2e-5 in Stage II, cosine scheduling, warmup ratio 0.03, max length 32768, bf16, gradient checkpointing True, Deepspeed Zero2, and training on H100-80G and 2×A100-80G. The model has 11B total parameters and roughly 3.39B–3.4B trainable parameters (She et al., 18 Sep 2025).

The held-out test set contains 27,577 images and 3,000 reports. Evaluation uses greedy decoding and covers report generation, ultrasound diagnosis, and VQA, with BLEU-1, ROUGE-1, ROUGE-L, METEOR, and BERTScore as metrics. On report generation, EchoVLM reports average scores of 53.87 BLEU-1, 61.69 ROUGE-1, 55.78 ROUGE-L, 53.16 METEOR, and 71.38 BERTScore. Compared with Qwen2-VL-Ultrasound, which obtains 43.72 BLEU-1 and 56.92 ROUGE-1, the paper emphasizes gains of +10.15 BLEU-1 and +4.77 ROUGE-1. On ultrasound diagnosis, the reported averages are 62.81 BLEU-1, 72.51 ROUGE-1, 67.16 ROUGE-L, 67.67 METEOR, and 75.43 BERTScore. On VQA, the averages are 26.52 BLEU-1, 38.14 ROUGE-1, 28.47 ROUGE-L, 24.18 METEOR, and 49.69 BERTScore (She et al., 18 Sep 2025).

The ablations attribute the gains to modular specialization. Adding the shared expert improves report generation by +4.58 BLEU-1, +3.27 ROUGE-1, +3.45 ROUGE-L, +3.62 METEOR, and +1.27 BERTScore, and improves diagnosis by +3.48 BLEU-1, +5.05 ROUGE-1, +5.49 ROUGE-L, +2.75 METEOR, and +4.05 BERTScore. Comparing Top-1 and Top-2 routing, Top-2 improves report generation BLEU-1 by +3.94 and diagnosis ROUGE-L by +5.00. Increasing the number of experts from 0 to 2 to 4 raises report-generation BLEU-1 from 43.72 to 50.33 to 53.87, and diagnosis BLEU-1 from 58.70 to 59.66 to 62.81 (She et al., 18 Sep 2025).

The paper identifies several limitations: long-tail class imbalance, especially for vascular cases; VQA remains challenging, particularly for multi-step reasoning; some case studies show false negatives in nodule detection; attention can drift to irrelevant borders or black regions; and routing plus data distribution may require further improvement. The clinical framing is supportive rather than substitutional: the model is positioned as a practical step toward clinically usable AI for ultrasound, not a replacement for expert review (She et al., 18 Sep 2025).

6. Relation to adjacent “Echo” systems and recurrent misconceptions

Several nearby “Echo”-prefixed systems are technically distinct from EchoVLM. EchoTrust is an evidence-driven Actor-Verifier framework for trustworthy reasoning in echocardiography VLM-based agents rather than a model named EchoVLM; it introduces a structured intermediate representation, a verifier, and a retry actor for MIMICEchoQA, achieving 0.76 accuracy versus 0.62 for actor-only inference (Huang et al., 7 Apr 2026). EchoSight is a multimodal retrieval-augmented generation framework for knowledge-based VQA using wiki retrieval and reranking rather than a medical ultrasound model (Yan et al., 2024). ECHO is a hybrid RL objective for terminal agents that adds an environment-prediction loss to GRPO (Shrivastava et al., 23 May 2026). Echo is a Large Audio LLM with audio-interleaved reasoning built on Qwen2.5-Omni (7B) (Wu et al., 12 Feb 2026).

The most persistent misconception in the provided material concerns autonomous driving. The system in “Listen, Look, Drive: Coupling Audio Instructions for User-aware VLA-based Autonomous Driving” is officially EchoVLA, not EchoVLM. The paper explicitly states that “VLA” matches its framing as a Vision-Language-Action model and that “VLM” would omit the action component (Guo et al., 17 Jan 2026).

Taken together, the two official EchoVLM papers occupy different points in the medical multimodal modeling landscape. The echocardiography variant emphasizes measurement grounding, view-aware contrastive structure, and negation-sensitive semantics (Li et al., 13 Dec 2025). The ultrasound variant emphasizes domain specialization through Dual-path MoE, multi-center scale, and multi-task sonographic generation and reasoning (She et al., 18 Sep 2025). This suggests that “EchoVLM” is best treated not as a single architecture family with fixed design rules, but as a reused name attached to distinct multimodal systems for different ultrasound-centered clinical objectives.