Papers
Topics
Authors
Recent
Search
2000 character limit reached

VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

Published 14 Nov 2025 in cs.CV and cs.LG | (2511.11450v1)

Abstract: We introduce VoxTell, a vision-LLM for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell

Summary

  • The paper introduces VoxTell, a universal 3D segmentation model that incorporates free-text prompts to generate accurate masks from CT, MRI, and PET images.
  • It uses a UNet-style encoder with multi-stage integration of prompt embeddings at every decoding stage, resulting in significant Dice score improvements over prior methods.
  • The work leverages a large, harmonized multi-modality dataset with over 9,000 label variants to achieve robust performance even on unseen anatomical and pathological classes.

VoxTell: Universal Free-text Promptable 3D Medical Image Segmentation

Introduction and Context

VoxTell (2511.11450) presents a vision–LLM for volumetric medical image segmentation that accepts arbitrary free-text prompts—ranging from single words to full-sentence clinical descriptions—and produces 3D segmentation masks in CT, MRI, and PET images. This work addresses critical shortcomings in both classical and prior text-promptable segmentation models: the inability to robustly generalize beyond predefined ontologies or limited phrasing, lack of resilience to linguistic variability, and poor handling of complex, instance-specific queries. Existing paradigms, such as SAM-based methods and previous text-prompted models, are constrained by closed-set labelmentality and late-stage fusion architectures, which lead to prompt-agnostic representations and brittle performance outside narrow training domains.

VoxTell is proposed as an universal 3D segmentation framework, leveraging multi-stage vision–language fusion across all decoder layers and integrating one of the largest, most semantically enriched 3D medical training corpora constructed to date. The approach is evaluated with rigorously open-world benchmarks: performance is reported exclusively on held-out datasets with both familiar and never-seen structures, across all main medical imaging modalities, incorporating real-world radiology report prompts and diverse anatomical and pathological targets.

Model Architecture and Multi-Stage Vision–Language Fusion

VoxTell employs a UNet-style encoder for volumetric data, optimally adapted for 3D medical modalities. A pre-trained, instruction-optimized text encoder (Qwen3-Embedding-4B) produces prompt embeddings, which serve as the query input to a transformer-based prompt decoder. Crucially, unlike prior approaches (e.g., MaskFormer, SAT), VoxTell injects textual conditioning not only at the output but at every decoding stage via a dedicated multi-scale adapter module.

At each UNet decoding stage, text features modulate image features through channel-wise dot products, extending MaskFormer-style query–image fusion deeply into the network's hierarchy. Deep supervision is enforced at all scales, ensuring prompt information is available to intermediate representations and compelling the model to maintain prompt sensitivity through the entire decoding process. Figure 1

Figure 1: The VoxTell pipeline directly maps free-text prompts to volumetric masks via multi-stage vision–language fusion throughout the decoder; textual features are injected at all scales for prompt alignment.

This design enables spatially grounded, semantically rich, and highly prompt-adaptive segmentation, allowing the model to parse not just ontological categories but also complex, location-aware, and instance-specific natural language—e.g., “spiculated lesion in the left upper lobe with pleural contact.”

Dataset, Vocabulary, and Standardization Pipeline

VoxTell training utilizes a newly curated multi-modality, multi-dataset corpus comprising over 62,000 CT, MRI, and PET volumes and encoding more than 1,000 unified anatomical and pathological concepts. This dataset is constructed through the aggregation and rigorous harmonization of 158 public sources, spanning comprehensive semantic coverage (organs, tissues, lesions, vessels, substructures) and pathologies across all body regions.

A semantic label expansion, harmonization, and human curation pipeline ensures that each label is linked to an extensive set of synonyms and rephrasings (over 9,000 label variants), supporting robust generalization to natural clinical language and edge-case queries. Figure 2

Figure 2: The comprehensive dataset aggregates 158 sources, covering over 1,000 anatomical/pathological concepts across modalities and anatomical regions, supporting language-driven segmentation.

The iterative process for standardizing label definitions across disparate sources is illustrated as: Figure 3

Figure 3: The iterative label harmonization workflow generates standardized and semantically consistent label variants, ensuring robust prompt grounding.

Experimental Evaluation and Results

State-of-the-Art Zero-shot Segmentation

VoxTell achieves consistently superior Dice scores over all previous text-promptable baselines and classic models such as TotalSegmentator across a broad array of unseen datasets and anatomical targets—healthy, pathological, multi-modality, and rare class scenarios. Notably:

  • On structures familiar from training, such as abdominal organs and liver tumors, Dice scores approach the high 80s to low 90s.
  • Across all evaluated pathologies, including rare diseases (adrenal tumors, multiple sclerosis, fractures), VoxTell remains robust, with prior models frequently failing.
  • For “cross-modality” transfer (e.g., segmenting known classes in a never-seen modality), VoxTell achieves strong Dice (e.g., 72.3 for PET breast cancer) where others essentially fail (<1).

Robustness to Prompt Variations

A core strength—absent in previous systems—is prompt resilience: VoxTell delivers stable segmentation across synonyms, paraphrasings, and misspellings, even for prompts and object classes never seen during training. No significant performance drop is observed for unfamiliar formulations. Figure 4

Figure 4: Across multiple prompt variants and synonyms for the same anatomical structure, VoxTell maintains consistently high Dice, whereas baselines are highly variable and often fail on unseen phrasings.

Generalization to Unseen Classes and Out-of-distribution Concepts

VoxTell segments never-seen or rare classes—e.g., “esophageal cancer” and “bladder cancer,” with Dice scores of 69.1 and 25.8, respectively—by leveraging its language-conditioned feature space, whereas all previous works fail or achieve near-zero.

Instance- and Report-level Segmentation

On ReXGroundingCT (linking radiology report findings to 3D segmentations) and an in-house radiotherapy cohort with sentence-level clinical prompts, VoxTell vastly outperforms prior SoTA (SAT) models. For complex descriptive queries, Dice rises from 0.0–1.2 (SAT, BiomedParseV2) to 50+ (VoxTell), demonstrating exact grounding of nuanced sentence-level clinical language. Figure 5

Figure 5: VoxTell outperforms prior SoTA on ReXGroundingCT, achieving higher Dice and instance-level hit-rate for segmentation from free-text chest CT findings.

Figure 6

Figure 6: Accurate qualitative segmentations by VoxTell for (a) known anatomical structures, (b) unseen pathologies, and (c) complex clinical report sentences, where baselines fail or degrade sharply.

Ablations: Multi-stage Fusion and Training Scale

Comprehensive ablation establishes that multi-stage vision–language fusion and deep supervision explain the majority of VoxTell's performance gains (10–15 Dice points over single-stage/late fusion) and prompt responsiveness. Scalability from batch size 2 to 128 on 64 GPUs yields further 7+ Dice improvement.

Discussion and Implications

The paper emphasizes several bold claims substantiated by strong empirical evidence:

  • VoxTell is the first universal 3D text-promptable segmentation model to achieve high-fidelity, robust performance on both known and never-seen anatomical/pathological classes, across all major medical imaging modalities, with full resilience to prompt variability.
  • The architectural paradigm—repeated injection of prompt information across the decoding hierarchy—is key to aligning language and volumetric representations, outperforming both closed-set and prior open-vocabulary design patterns.
  • The large-scale, harmonized, and synonym-augmented vocabulary is essential for achieving prompt robustness and zero-shot generalization, proposing a new best practice for data curation in medical vision–LLMs.

For practical translation, VoxTell enables workflow efficiencies by permitting radiologists and clinicians to segment any structure or anomaly with natural language, reducing reliance on manual tooling or hard-coded ontologies. The framework unlocks straightforward integration of clinical report data and enables new paradigms for automated annotation and decision support in medical imaging.

The limitations identified by the authors—primarily, residual gaps for entirely out-of-distribution anatomical regions (complete absence from training)—point toward directions for prompt-augmented few-shot learning, domain adaptation, or self-supervised expansion using report–image pairings. Nonetheless, for semantically related or linguistically proximate concepts, VoxTell already demonstrates substantial open-set capabilities.

Conclusion

VoxTell introduces a robust, text-conditioned 3D segmentation architecture founded on multi-stage vision–language fusion and large-scale, harmonized data. It achieves state-of-the-art results over all major baselines for zero-shot, cross-modality, and instance-level segmentation from free-form clinical language. This work establishes new standards for generalization, linguistic robustness, and promptability in volumetric medical image segmentation, with strong implications for the design of future universal vision–LLMs and clinical AI systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces VoxTell, a computer program that can “read” text and use it to find and highlight specific parts inside 3D medical scans (like CT, MRI, and PET). Instead of clicking on points or drawing boxes, a doctor can type a simple word (“liver”) or a full sentence (“small tumor in the right upper lung”), and VoxTell will try to outline that area in the scan.

What were the researchers trying to find out?

The team wanted to answer a few simple questions:

  • Can a single model segment many different body parts and diseases in 3D scans using only free text?
  • Will it work across different scan types (CT, MRI, PET), even if it hasn’t seen that exact type in training?
  • Can it understand real clinical language, including synonyms, misspellings, and detailed sentences like those in radiology reports?
  • Will it still work on new or rare things it hasn’t been trained on?

How did they do it?

Think of a 3D scan like a stack of slices that make up a loaf of bread—the model needs to figure out which slices and areas belong to a body part or a lesion. VoxTell works in three main steps:

  1. Text understanding:
    • A powerful “text encoder” reads the prompt (from one word to full sentences) and turns it into numbers the computer can use. This is like translating the doctor’s words into a guide for the model.
  2. Image understanding:
    • A 3D “UNet-style” network looks at the scan at different zoom levels (coarse to fine), like starting with a map and then zooming in to street level. This helps the model understand both the big picture and details.
  3. Multi-stage vision–language fusion:
    • Instead of waiting until the end, VoxTell mixes the text guidance with the image features at multiple steps while it’s decoding the image. Picture a teacher giving instructions not just at the end of an assignment, but at every step—this keeps the model aligned with the text. The paper calls this “multi-stage fusion.”
    • It also uses “deep supervision,” which is like grading the homework at several checkpoints, not just at the final answer. That encourages the model to follow the prompt from early on.

Training and data:

  • VoxTell was trained on more than 62,000 3D scans across CT, MRI, and PET and over 1,000 different anatomical and disease concepts. The team also cleaned up and expanded the vocabulary, so it understands many ways to say the same thing (like “hepatic organ” for “liver”).

How they measured success:

  • They used the Dice score, which tells how much the model’s outline overlaps with the true outline. Imagine two shapes: if they overlap perfectly, Dice is 100; if they don’t overlap at all, it’s 0. Higher is better.

What did they find, and why is it important?

Here are the main results, explained simply:

  • Strong performance across many structures and diseases:
    • VoxTell beat other text-based segmentation models on many datasets in CT, MRI, and PET, including organs and lesions. It also did better than traditional tools that don’t use text.
  • Works on new scan types and unseen concepts:
    • Even when the model saw a familiar structure in a new modality (like MRI instead of CT), it still performed well.
    • It could segment some structures and cancers it hadn’t been trained on (for example, esophageal tumors and bladder cancer), showing it can generalize.
  • Understands different ways of saying the same thing:
    • VoxTell stayed accurate even with synonyms, rephrasings, and minor typos. That’s important because clinical language is varied and detailed.
  • Handles real radiology report sentences:
    • On a set of lung cancer patients, VoxTell used full sentences from reports (like “spiculated carcinoma in the left upper lobe with pleural contact”) to find the tumor, significantly outperforming other models.
  • Instance-specific segmentation:
    • On the ReXGroundingCT benchmark, which tests whether models can find exactly the right lesion mentioned in text, VoxTell scored higher than the previous top method. This shows it can use text to locate the specific instance described.
  • Why it works:
    • Ablation studies (controlled comparisons) showed that mixing text and image features at multiple stages and using deep supervision both make a big difference. Scaling up training also helps.

What does this mean going forward?

VoxTell brings us closer to “tell me what to find” medical imaging, where:

  • Doctors and radiologists can use natural language to find and measure organs or lesions, saving time and reducing manual clicking.
  • The model can link to existing clinical reports, helping automate parts of diagnosis and treatment planning (like outlining tumors for radiotherapy).
  • It shows progress toward open-set segmentation: handling new concepts or scan types without being explicitly trained on them.

In short, VoxTell could make medical imaging more flexible and efficient by letting clinicians use plain language to guide powerful 3D segmentation, potentially improving care and speeding up workflows.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper.

  • Prospective clinical validation: No prospective or randomized studies demonstrate clinical utility (e.g., time savings, planning accuracy, outcome impact) compared to radiologist workflows; define task-specific acceptance thresholds and compare against expert interobserver variability.
  • Safety mechanisms at inference: The model lacks calibrated uncertainty estimates, confidence maps, and explicit refusal/OOD detection for unsupported or ambiguous prompts; evaluate false-positive behavior on absent-target queries and implement safeguards.
  • Robust handling of complex clinical language: Prompt robustness is shown for synonyms and minor typos, but not for negation, uncertainty (“likely,” “cannot rule out”), temporality (“interval growth”), multi-target conjunctions (“liver and spleen”), quantifiers (“three nodules”), or relational phrases (“lesion adjacent to pleura”); design tests and model components to parse these constructs.
  • Multilingual and cross-institutional language generalization: Evaluation is limited to English; assess performance on other languages (e.g., German—given the institutional context), mixed-language reports, and institution-specific jargon, including domain adaptation strategies.
  • Automated report-to-prompt pipeline: The method uses report-derived prompts but does not specify an automated extraction pipeline (coreference resolution, entity disambiguation, section filtering); benchmark end-to-end report parsing accuracy and its effect on segmentation.
  • Instance-level segmentation beyond single findings: It is unclear how the model handles multiple instances (counting, distinct masks, selection by attributes such as size or laterality); evaluate multi-instance prompts and instance differentiation without additional fine-tuning.
  • Negative and adversarial prompts: While negative prompts are sampled during training, there is no systematic evaluation of model behavior under adversarial phrasing, logically inconsistent instructions, or contradictory statements; develop adversarial and stress test suites.
  • Vocabulary harmonization reliability: The LLM-based label expansion and cross-dataset harmonization may introduce semantic drift or errors; quantify mapping accuracy, human validation rates, and downstream impact of noisy or inconsistent label semantics.
  • Label definition ambiguity and scope: The paper does not clarify how heterogeneous label definitions across datasets (e.g., “liver” including/excluding lesions) were resolved in training targets; assess sensitivity to differing ontologies and provide standardized definitions.
  • Spatial grounding and orientation consistency: Robustness to varying DICOM orientation conventions, laterality inconsistencies, and patient positioning is not evaluated; audit orientation handling and quantify left/right errors across datasets.
  • Generalization to additional modalities and acquisition types: Only CT, MRI, PET are tested; evaluate ultrasound, cone-beam CT, low-dose CT, contrast phases, and 4D/temporal sequences (e.g., dynamic PET, cardiac cine MRI).
  • Performance on rare tail categories: Although gains are reported, rare or low-prevalence classes are not analyzed separately; provide per-tail metrics and sample efficiency comparisons (few-shot or zero-shot settings).
  • Boundary and small-object accuracy: Metrics are limited to Dice and HIT5%; report boundary-sensitive metrics (e.g., Hausdorff distance), detection performance for small lesions, and clinical risk analysis for boundary errors.
  • Calibration vs. scale: Ablation suggests benefits from multi-stage fusion and deep supervision, but disentangling architectural gains from pure scale (batch size 128 on 64 A100s) is incomplete; run controlled scaling experiments across baselines.
  • Computational efficiency and deployability: Training requires substantial compute; profile inference latency, memory footprint, and explore compression/quantization/pruning for clinical deployment on commodity hardware.
  • End-to-end text–vision co-training: The text encoder is frozen; investigate end-to-end co-training, domain-specific text pretraining, or multimodal alignment objectives to improve complex prompt understanding.
  • Synergy with spatial prompts: The model is text-only at inference; evaluate hybrid interaction (text + points/boxes/scribbles) for more precise control, especially on challenging or ambiguous targets.
  • Handling of multiple targets and compositional prompts: No experiments demonstrate segmentation for compound instructions (“segment liver and spleen, exclude lesions,” “only enhancing components”); enable set-based outputs and compositional operators (include/exclude/subset).
  • Adaptation to unseen concepts: Unseen-class performance varies widely (e.g., bladder cancer 25.8 Dice); explore few-shot adaptation, concept expansion via weak supervision, and knowledge-graph/ontology guidance to improve open-set coverage.
  • OOD dataset breadth and demographics: Although evaluations use unseen datasets, the breadth of institutions, scanners, patient demographics (age, sex, race), and disease prevalence is not analyzed; conduct stratified robustness and fairness audits.
  • Preprocessing sensitivity: The effect of intensity normalization, resampling, cropping, and artifact handling on text-prompted performance is not reported; quantify preprocessing sensitivities and standardize pipelines.
  • Error analysis and failure modes: Provide systematic analyses of common failure patterns (e.g., low contrast, motion artifacts, overlapping tissues), with targeted remediation strategies.
  • Explainability and interpretability: There is no mechanism to visualize how text influences spatial predictions (e.g., cross-modal attention maps, gradient-based attribution); develop tools to make decisions auditable for clinicians.
  • Clinical acceptance criteria: Define clinically meaningful thresholds per task (e.g., tumor contouring tolerances) and assess whether reported Dice values meet those criteria.
  • Data and label provenance: The instance-level dataset partially derives instances via TotalSegmentator; quantify conversion errors and their effect on training and evaluation, and report provenance per class.
  • Regulatory and reproducibility considerations: Clarify model weight availability, deterministic inference settings, and documentation needed for regulatory pathways; establish standard splits and benchmarks for reproducibility.

Practical Applications

Immediate Applications

Below are actionable use cases that can be deployed now, organized by sector. Each item notes key assumptions or dependencies that influence feasibility.

  • Healthcare (Radiology)
    • Free-text, report-driven lesion and organ segmentation within PACS and radiology viewers
    • Example workflow: A radiologist types “spiculated carcinoma in the left upper lobe” or imports the report text; VoxTell returns a 3D mask aligned to the described finding (validated on ReXGroundingCT and a held-out SBRT cohort).
    • Tools/products: PACS plugin; “Report-to-Mask” API; radiology viewer extension for text prompts.
    • Dependencies: Integration with PACS/RIS; GPU inference; clinician-in-the-loop QA; site-specific validation; regulatory compliance for clinical use.
    • Semi-automated contouring for radiation therapy planning
    • Example workflow: Dosimetrist inputs “GTV of right lung primary” or “mediastinal nodes level 2R”; VoxTell produces candidate contours for review/edit.
    • Tools/products: RT planning system add-on; “ContourAssist” pre-contour generator.
    • Dependencies: Plan system integration (DICOM RTSTRUCT), QA policies, contouring consensus guidelines, liability/approval pathways.
    • Cross-modality segmentation support (CT/MRI/PET) without retraining
    • Example workflow: Segment “pancreatic tumor” in MRI even if trained predominantly on CT; useful for sites with mixed modality protocols.
    • Tools/products: Modality-agnostic segmentation service “MultiMod-Seg.”
    • Dependencies: Local performance audits across scanners/sequences; calibration for uncommon modalities; GPU memory for 3D volumes.
    • Longitudinal lesion tracking and change detection driven by report language
    • Example workflow: Extract prompts from prior and current reports (“left adrenal metastasis”); segment and compute volumetric changes.
    • Tools/products: “LesionTracker” pipeline (report extraction → segmentation → volume trend).
    • Dependencies: Reliable report parsing (NLP), consistent acquisition protocols, identity matching across timepoints.
  • Healthcare (Operations and Quality)
    • Report–image consistency checks (QA)
    • Example workflow: Automatically verify that report claims (“no pleural effusion”) are consistent with segmentation outputs; flag discrepancies.
    • Tools/products: “ReportQA” auditor for structured QA dashboards.
    • Dependencies: Thresholding strategies; false-positive handling; governance for escalation; medico-legal considerations.
    • Coding and billing assistance for imaging findings
    • Example workflow: Segmentations derived from text prompts are mapped to codes (e.g., SNOMED/ICD) to support claim accuracy.
    • Tools/products: “Findings-to-Code” assistant.
    • Dependencies: Ontology alignment; payer policy; compliance and audit trails.
  • Academia (Medical Imaging Research)
    • Rapid dataset curation and label harmonization
    • Example workflow: Use VoxTell’s vocabulary expansion/harmonization pipeline to unify labels across multi-dataset cohorts; accelerate annotation and re-labeling.
    • Tools/products: “Vocab Harmonizer” scripts; semi-automatic relabeling toolkit.
    • Dependencies: Domain expert validation of harmonized labels; provenance tracking; public dataset licenses.
    • Open-vocabulary segmentation baselines for OOD evaluation
    • Example workflow: Employ VoxTell as a baseline for rigorous OOD testing across new datasets/modalities; use its multi-stage fusion as a reference design.
    • Tools/products: Benchmark recipes; reproducible evaluation scripts.
    • Dependencies: Compute capacity; curated OOD datasets; standardized metrics.
  • Education and Training
    • Interactive anatomy and pathology teaching with free-text prompts
    • Example workflow: Students query “hepatic vasculature,” “right rib cage,” “multiple sclerosis lesions,” and visualize 3D masks on real scans.
    • Tools/products: Viewer for teaching hospitals; “AnatomyPrompt” classroom tool.
    • Dependencies: De-identified teaching data; supervision by faculty; clear disclaimers.
  • Policy and Governance
    • Institutional AI-assisted segmentation guidelines and guardrails
    • Example workflow: Draft SOPs for radiologist/dosimetrist oversight, error reporting, and validation steps when using text-prompted segmentation.
    • Tools/products: Policy templates; validation protocols; bias/robustness checklists.
    • Dependencies: Clinical validation data; stakeholder buy-in; alignment with regulatory guidance.
  • Daily Life (Patient Engagement)
    • Patient portal visualizations aligned to report text
    • Example workflow: Patients view 3D masks of findings (“kidney cyst”) from their reports for education and shared decision-making.
    • Tools/products: Patient-friendly viewer embedded in portals.
    • Dependencies: Clinician approval workflows; health literacy considerations; secure data access; robust consent.

Long-Term Applications

The following applications require further research, scaling, or development before robust deployment.

  • Healthcare (Clinical Decision Support)
    • End-to-end report-grounded planning and documentation
    • Vision-language pipelines that convert narrative findings into structured segmentations, measurements, and standardized reporting automatically.
    • Potential products: “ReportGround” end-to-end pipeline (report extraction → segmentation → measurements → structured report).
    • Dependencies: High-fidelity NLP extraction; evidence generation; regulatory approval; integration with EHR and reporting systems.
    • Open-set, rare pathology segmentation across institutions and languages
    • Robust performance on truly unseen categories across demographic, scanner, and language variations, including non-English clinical text.
    • Dependencies: Multilingual text encoders; diverse training corpora; fairness/bias audits; domain shift calibration.
    • Surgical, interventional, and intraoperative guidance
    • Real-time text-driven segmentation for navigation (e.g., “arterial phase liver lesion near segment VIII”).
    • Dependencies: Real-time inference; hardware (OR-grade GPUs); co-registration with tracking systems; safety certification.
  • Healthcare (Population Health and Research)
    • Hospital- and population-scale imaging phenotyping
    • Auto-segmentation of large archives based on free-text queries (e.g., “coronary calcifications,” “vertebral fractures”) to study prevalence and outcomes.
    • Dependencies: High-throughput compute; scalable storage; governance for secondary use; robust de-identification.
    • Clinical trial automation
    • Eligibility pre-screening using report-derived segmentations (e.g., “solitary lung nodule <3 cm”) and longitudinal response quantification.
    • Dependencies: Trial-specific criteria validation; legal/ethical approvals; inter-site generalization.
  • Software and Robotics
    • Generalization of multi-stage vision–language fusion to non-medical 3D domains
    • Voice/text-guided segmentation for industrial CT, materials inspection, geoscience volumes, and 3D mapping in robotics.
    • Potential products: “OpenVLM-3D” adaptable segmentation frameworks.
    • Dependencies: Domain-specific retraining; dataset availability; safety standards for industrial/robotic deployment.
    • Integration with multimodal LLMs for reasoning-aware segmentation
    • Systems that reason over clinical context (history, labs, protocols) to refine text prompts and segmentation targets dynamically.
    • Dependencies: Reliable multimodal reasoning; hallucination safeguards; clinical governance.
  • Academia (Methods and Benchmarks)
    • Standardized benchmarks for report-grounded instance segmentation
    • Larger, multi-institution datasets linking narrative findings to 3D masks across modalities and languages.
    • Dependencies: Data-sharing consortia; annotation tooling; funding and governance frameworks.
    • Methodological advances in multi-scale fusion and deep supervision
    • Extending the architecture to jointly learn spatial grounding, uncertainty quantification, and causal language–vision links.
    • Dependencies: New losses, calibration metrics, and interpretability methods.
  • Policy and Regulation
    • Regulatory pathways, reimbursement models, and liability frameworks for text-prompted segmentation
    • Standards for validation, post-market surveillance, and coding (e.g., CPT) of AI-assisted segmentation outputs.
    • Dependencies: Clinical trials; health technology assessment; consensus standards (DICOM extensions for prompt provenance; auditability).
  • Daily Life (Consumer Health and Education)
    • AR/VR tools for personalized imaging explanations
    • Patients explore their own scans with narrated, text-driven overlays for pre-surgical counseling or chronic disease monitoring.
    • Dependencies: Usability studies; accessibility; strong safeguards to prevent misunderstanding; clinician mediation.

Cross-cutting assumptions and dependencies

  • Robustness and generalization: Performance depends on prompt quality, text encoder coverage (including multilingual/clinical jargon), and the match between local imaging protocols and training distributions.
  • Safety and oversight: Clinician-in-the-loop review remains critical; deploy only with local validation, QA workflows, and clear disclaimers.
  • Integration: Successful adoption requires seamless PACS/EHR/DICOM RTSTRUCT interoperability, secure data handling, and MLOps support.
  • Compute and scalability: 3D inference can be GPU-intensive; throughput and latency must meet clinical workflow constraints.
  • Governance: Privacy, consent, bias/fairness, and auditability need institutional policies and continuous monitoring.

Glossary

Below is an alphabetical list of advanced domain-specific terms appearing in the paper, each with a brief definition and a verbatim example from the text.

  • Ablation study: A controlled experiment that systematically removes or alters components to quantify their impact on performance. "Ablation Study on Text–Image Fusion."
  • Autosegmentation: Automated generation of image segmentation masks without manual delineation. "a classical autosegmentation tool without text prompts."
  • Boltzmann-distributed attention: An attention mechanism whose weights follow a Boltzmann distribution to emphasize certain targets (e.g., small objects). "Boltzmann-distributed attention."
  • Closed-set segmentation: Segmentation restricted to a predefined set of categories known at training time. "closed-set segmentation models"
  • Cross-dataset harmonization: Standardizing and reconciling label definitions across multiple datasets. "cross-dataset harmonization"
  • Cross-entropy loss: A probabilistic loss function commonly used for classification and segmentation, often combined with Dice loss. "cross-entropy losses"
  • Cross-modality transfer: Generalizing a model’s knowledge across different imaging modalities (e.g., CT, MRI, PET). "strong cross-modality transfer"
  • Deep supervision: Applying auxiliary losses at intermediate network layers to improve gradient flow and feature learning. "with deep supervision."
  • Dice coefficient: An overlap metric for segmentation masks, measuring similarity between prediction and ground truth. "Dice coefficient."
  • Encoder skip connection: A UNet link that concatenates encoder features with decoder features at corresponding scales. "encoder skip connection z_s"
  • Foundation model: A large, broadly pre-trained model adaptable to many tasks via fine-tuning or prompting. "foundation models for medical images"
  • Free-text prompt: An unconstrained natural-language query used to guide the model’s segmentation. "free-text prompts"
  • Gross tumor volume (GTV): The visible or palpable extent of tumor used for radiotherapy planning. "gross tumor volume (GTV)"
  • HIT5%_{5\%}: A hit-rate metric counting cases where predicted segmentation achieves at least 5% Dice. "HIT5%_{5\%}, the fraction of samples achieving Dice \ge 5\%."
  • Instance-level annotation: Labels identifying and separating individual object occurrences of the same class. "instance-level annotations"
  • Instance-specific segmentation: Segmenting a particular object instance described by text rather than a general class. "instance-specific segmentation from real-world text."
  • Mask2Former: A transformer-based segmentation architecture extending MaskFormer with mask prediction across tasks. "Mask2Former-style architecture"
  • MaskFormer: A transformer-based approach that uses queries to produce segmentation masks via feature fusion. "MaskFormer"
  • Modality shift: A distribution change caused by switching imaging modalities or acquisition protocols. "modality shifts"
  • Open-set generalization: The ability to handle classes or concepts not seen during training. "Towards open-set generalization"
  • Open-vocabulary segmentation: Segmenting objects described by arbitrary text beyond a fixed label set. "open-vocabulary segmentation from free-form textual prompts"
  • Out-of-distribution (OOD): Data whose distribution differs from the training set, used to test robustness. "out-of-distribution (OOD) images"
  • Polynomial decay: A learning rate schedule that decreases following a polynomial function over training. "polynomial decay"
  • Positron Emission Tomography (PET): A nuclear imaging modality measuring metabolic activity using radiotracers. "Positron Emission Tomography (PET)"
  • Prompt decoder: A module that transforms text embeddings into guidance features for the vision decoder. "A transformer-based prompt decoder"
  • Radiology report: A clinician-authored narrative describing imaging findings and interpretations. "radiology reports"
  • ReXGroundingCT: A benchmark linking free-text report findings to 3D segmentations in chest CT. "ReXGroundingCT"
  • Stereotactic body radiotherapy (SBRT): A precise, high-dose radiotherapy technique delivered in few fractions. "stereotactic body radiotherapy (SBRT)"
  • Text-promptable segmentation: Producing segmentation masks directly conditioned on textual prompts. "text-promptable segmentation"
  • TotalSegmentator: A widely used tool for comprehensive anatomical autosegmentation in CT. "TotalSegmentator"
  • Transformer decoder: The component that processes queries attending to encoder memory to produce outputs. "transformer decoder"
  • UNet: An encoder–decoder CNN with skip connections, widely used in medical image segmentation. "UNet-style backbone"
  • Vision–language fusion: Combining textual and visual features so language can guide image understanding. "multi-stage vision–language fusion"
  • Zero-shot: Evaluating on tasks or classes without task-specific training examples. "Zero-shot segmentation performance"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.