Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding (2504.14692v1)

Published 20 Apr 2025 in cs.CL

Abstract: The practical deployment of medical vision-LLMs (Med-VLMs) necessitates seamless integration of textual data with diverse visual modalities, including 2D/3D images and videos, yet existing models typically employ separate encoders for different modalities. To address this limitation, we present OmniV-Med, a unified framework for multimodal medical understanding. Our technical contributions are threefold: First, we construct OmniV-Med-Instruct, a comprehensive multimodal medical dataset containing 252K instructional samples spanning 14 medical image modalities and 11 clinical tasks. Second, we devise a rotary position-adaptive encoder that processes multi-resolution 2D/3D images and videos within a unified architecture, diverging from conventional modality-specific encoders. Third, we introduce a medical-aware token pruning mechanism that exploits spatial-temporal redundancy in volumetric data (e.g., consecutive CT slices) and medical videos, effectively reducing 60\% of visual tokens without performance degradation. Empirical evaluations demonstrate that OmniV-Med-7B achieves state-of-the-art performance on 7 benchmarks spanning 2D/3D medical imaging and video understanding tasks. Notably, our lightweight variant (OmniV-Med-1.5B) attains comparable performance while requiring only 8 RTX3090 GPUs for training and supporting efficient long-video inference. Data, code and model will be released.

Summary

OmniV-Med: Scaling Medical Vision-LLM for Universal Visual Understanding

The paper "OmniV-Med: Scaling Medical Vision-LLM for Universal Visual Understanding" presents a novel framework aimed at unifying the processing of diverse medical visual modalities with text inputs through the development of a scalable vision-LLM (VLM) named OmniV-Med. This paper addresses a critical limitation prevalent in existing medical VLMs, which typically employ separate encoders for different modalities such as 2D and 3D images and video data. The researchers introduce a unified architecture capable of handling multiple resolutions and modalities, thereby enhancing multimodal medical understanding within clinical scenarios.

Key Contributions

This research is articulated through three primary technical contributions:

  1. OmniV-Med-Instruct Dataset: The authors have curated an extensive multimodal medical dataset comprising 252,000 instructional samples. These samples span 14 distinct medical image modalities and cover 11 clinical tasks, thus providing a robust foundation for training comprehensive vision-LLMs.
  2. Rotary Position-Adaptive Encoder: A core innovation in this paper is the introduction of a modality-agnostic encoder, which diverges from the typical modality-specific architectures. This encoder utilizes rotary position-adaptive encoding to process 2D, 3D images, and videos, thereby addressing spatial and temporal dynamics uniformly without the need for separate encoders for each modality.
  3. Medical-Aware Token Pruning Mechanism: The researchers introduce a token pruning mechanism designed to eliminate spatial-temporal redundancy inherent in consecutive CT slices and medical videos. Notably, this mechanism reduces visual tokens by up to 60% without compromising performance, facilitating efficient processing.

Empirical Evaluation

The performance evaluations reflect the efficacy of OmniV-Med, achieving state-of-the-art results across seven benchmarks, which include various tasks pertinent to 2D/3D medical imaging and video analysis. The flagship model, OmniV-Med-7B, demonstrates superior performance, while the lighter variant, OmniV-Med-1.5B, achieves comparable performance with reduced computational requirements, utilizing only 8 RTX3090 GPUs for training and supporting efficient long-video inference.

Implications and Future Directions

The implications of the OmniV-Med framework are substantial, with the potential to catalyze advancements in AI-assisted clinical practices such as surgery and treatment planning through versatile and scalable VLMs. The ability to unify diverse clinical modalities within a single model architecture signifies a step towards holistic AI solutions in medicine.

The scalability and efficiency evidenced by OmniV-Med suggest promising avenues for further research and development. Future endeavors could explore optimizing the model further for real-time applications and expanding its capabilities in additional clinical contexts. Furthermore, the release of data, code, and models offers an invaluable resource for the research community, fostering collaboration and accelerated innovation in medical AI.

Conclusion

OmniV-Med represents a pivotal advancement in the field of medical vision-LLMs, demonstrating that unified architectures can surmount the fragmentation issues faced by current domain-specific solutions. The convergence of adaptable encoders and intelligent token pruning mechanisms underscores the innovative essence of this paper, setting a precedent for future work in multimodal AI applications within healthcare. As the paper foreshadows, practical implementation and the continued evolution of such models could significantly enhance AI-driven medical diagnostics and interventions, augmenting the scope and impact of computational healthcare.

Youtube Logo Streamline Icon: https://streamlinehq.com