Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 52 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

CODER: Knowledge infused cross-lingual medical term embedding for term normalization (2011.02947v3)

Published 5 Nov 2020 in cs.CL

Abstract: This paper proposes CODER: contrastive learning on knowledge graphs for cross-lingual medical term representation. CODER is designed for medical term normalization by providing close vector representations for different terms that represent the same or similar medical concepts with cross-lingual support. We train CODER via contrastive learning on a medical knowledge graph (KG) named the Unified Medical Language System, where similarities are calculated utilizing both terms and relation triplets from KG. Training with relations injects medical knowledge into embeddings and aims to provide potentially better machine learning features. We evaluate CODER in zero-shot term normalization, semantic similarity, and relation classification benchmarks, which show that CODERoutperforms various state-of-the-art biomedical word embedding, concept embeddings, and contextual embeddings. Our codes and models are available at https://github.com/GanjinZero/CODER.

Citations (88)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

Analyzing CODER: A Knowledge-Infused Model for Cross-Lingual Medical Term Normalization

The paper "CODER: Knowledge infused cross-lingual medical term embedding for term normalization" introduces a machine learning model, CODER, which aims to improve the normalization of medical terms across different languages. It effectively addresses the challenges inherent in processing diverse terminologies within electronic medical records (EMRs). Traditionally, such normalization tasks are fraught with difficulties related to nonstandard naming conventions, abbreviations, and language variations. The proposed model leverages contrastive learning through knowledge graphs to generate cross-lingual medical term embeddings, enhancing medical semantic representation and facilitating successful normalization.

Technical Approach

CODER operates by generating close vector representations for medical terms sharing similar meanings, utilizing a framework based on contrastive learning. The model is trained on the Unified Medical Language System (UMLS), a rich medical knowledge graph that contains both terms and relational triplets. These relational triplets ensure that embeddings encapsulate substantial medical domain knowledge, allowing CODER to outperform existing methodologies.

A dual contrastive learning strategy is implemented to grasp term-term and term-relation-term similarities. By learning from positive term pairs (similar medical terms) and injecting relational knowledge from the UMLS, CODER distinguishes itself from models strictly focused on terms.

Evaluation and Results

The authors evaluate CODER using various benchmarks, including zero-shot term normalization, semantic similarity measurement, and relation classification. In zero-shot term normalization, CODER surpasses other state-of-the-art biomedical word, concept, and contextual embeddings. The model demonstrated its efficacy across multilingual datasets, including the MANTRA GSC corpus, which comprises term normalization tasks in several languages. CODER's performance was notably higher than several translation-based approaches, emphasizing the model's ability to effectively leverage multilingual synonyms available in the UMLS.

CODER also excelled in Medical Conceptual Similarity Measure (MCSM) tasks, showing superior results compared to other embeddings. Additionally, in the Disease Database Relation Classification (DDBRC) task, CODER achieved the highest accuracy, evidencing its value both as a fixed feature-based and a trainable embedding.

Implications and Future Work

CODER offers nuanced improvements over competing models by incorporating relational knowledge and supporting cross-lingual term representations. This facilitates EMR analyses on a global scale, addressing multi-institutional discrepancies and supporting diverse linguistic datasets. The potential application of CODER extends beyond medical term normalization, as its embeddings can serve as valuable features in machine learning models for various downstream tasks like predictive analytics and clinical decision support systems.

Further research could explore the optimization of relational knowledge encoding within similar frameworks, improve cross-linguistic representation under resource-constrained conditions, or expand CODER’s application within more complex medical ontologies. Moreover, continuous training on updated medical terminologies may enhance the model's adaptability and functional scope in real-world medical data applications.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.