Papers
Topics
Authors
Recent
2000 character limit reached

Hebrew Medical Language Model

Updated 19 December 2025
  • Hebrew Medical Language Model is a specialized Transformer-based NLP system designed to extract clinical information from Hebrew medical records.
  • It utilizes encoder-only architectures and innovative tokenization methods, such as byte-level BPE and expanded WordPiece vocabularies, to handle rich, domain-specific language.
  • The model integrates auxiliary tasks and prompt-based extraction strategies to enable accurate clinical timeline construction and fine-grained disease phenotyping.

A Hebrew medical LLM is a domain-adapted Transformer-based NLP system specifically pretrained or fine-tuned on clinical and biomedical Hebrew corpora for the purposes of information extraction, clinical timeline construction, and interpretation of unstructured or semi-structured medical texts. These models are critical in low-resource language contexts where the lack of massive annotated datasets, combined with significant privacy and domain-specific vocabulary challenges, hinders the direct application of general LLMs or transfer learning approaches standard in English-language medical NLP.

1. Model Architectures and Domain Adaptation

Contemporary Hebrew medical LLMs leverage encoder-only architectures derived from pretrained Hebrew Transformers. Notable base models include HeRo, a RoBERTa-base variant with 12 transformer encoder layers, 768 hidden size, 12 attention heads, and parameter counts in the 110 million range (Hazan et al., 2 May 2024), as well as DictaBERT 2.0, which features similar depth but a substantially larger WordPiece vocabulary (128k tokens) to support token coverage for morphologically rich Hebrew (Hashiloni et al., 12 Dec 2025). Tokenization adapts byte-level BPE (SMP-BERT) or WordPiece procedures, followed by further vocabulary extension and normalization for medical terms.

Vocabulary adaptation to the medical domain is performed via two principal strategies:

  • Simple method: Identify the most frequent new tokens from a large random sample of clinical texts and perform a set-union with the preexisting vocabulary. Embeddings for new tokens are typically initialized as the mean of their constituent sub-word embeddings.
  • AdaLM method: Iterative addition of batches of new tokens until token coverage gain falls below a preset threshold, optimizing compression ratio and token utilization.

The FLOTA algorithm (“Few Longest Token Approximation”) can be used to maximize the use of newly introduced medical tokens, further improving token efficiency during both pretraining and downstream tasks.

2. Domain-Specific Pretraining and Auxiliary Supervision

Domain-adaptive pretraining augments standard Masked Language Modeling (MLM) with auxiliary tasks designed around the unique structures of clinical narratives. For radiology, Section Matching Prediction (SMP) is introduced as an auxiliary task to encode document-structural coherence. In SMP, paired segments from the “Findings” (x⁽ᴲ⁾) and “Impression” (x⁽ᴵ⁾) sections are concatenated as follows:

Input: [CLS]xF[SEP]xI[EOS]\text{Input: } [\text{CLS}]\, x^{F}\, [\text{SEP}]\, x^{I}\, [\text{EOS}]

A binary classification head predicts whether the two sections correspond (Match) or are unrelated (NotMatch), with the loss formulated as:

LSMP=in{Match,NotMatch}1[ni=n]logqm(nxiF,xiI)L_\text{SMP} = -\sum_{i} \sum_{n \in \{\text{Match},\,\text{NotMatch}\}} 1[n_i = n] \cdot \log q_m(n|x_i^{F}, x_i^{I})

Pretraining corpora are constructed from millions of de-identified Hebrew radiology reports, covering multi-center hospital data (e.g., 9,683 CT/MRI reports from 8,093 Crohn’s disease patients (Hazan et al., 2 May 2024, Badash et al., 3 Sep 2025)) or over 5 million hospital records for broader clinical models (Hashiloni et al., 12 Dec 2025). For both approaches, extensive, privacy-aware de-identification via NER, regex-mask, and realistic surrogate replacement is implemented to mitigate re-identification risk while maintaining linguistic fidelity.

3. Prompt-Based Structured Extraction Paradigms

Prompt-based inference supplants conventional per-label classifiers with unified verbalizer templates mapping phenotypic, event, or pathology labels to short natural-language prompts. During inference, the “Impression” segment is replaced with a label-specific prompt, and the same pretrained model weights are leveraged for prediction, facilitating zero-shot and few-shot adaptability even in extreme low-resource scenarios:

  • SMP-BERT Input:

[CLS]xF[SEP]pj[EOS][\text{CLS}]\, x^{F}\, [\text{SEP}]\, p^j\, [\text{EOS}]

where pjp^j is the prompt for label yjy^j.

  • Prediction rule:

y^=argmax{qm(MatchxF,p+),qm(MatchxF,p)}\hat{y} = \arg\max \{q_m(\text{Match}|x^{F}, p^+),\, q_m(\text{Match}|x^{F}, p^-)\}

Hierarchical prompting extends this structure: prompts are organized into logical trees (e.g., scan-level → organ-level → finding-level), and negative parent nodes prune entire subtrees, which achieves an order-of-magnitude reduction in inference runtime without sacrificing accuracy (Badash et al., 3 Sep 2025).

4. Dataset Construction, Annotation, and Class Imbalance Handling

Annotated datasets for Hebrew medical NLP typically feature multi-label, multi-organ, and multi-finding schemas (e.g., 6 organs × 6 findings = 36 phenotypes in Crohn’s radiology (Hazan et al., 2 May 2024), or 90 structured labels per case (Badash et al., 3 Sep 2025)), extracted from curated subsets of larger clinical corpora. Prevalence per label is highly imbalanced, ranging from 3% to 44%, necessitating robust positive/negative instance construction during fine-tuning and evaluation.

Data annotation is dual-read by board-certified clinicians, and splits are multilabel-stratified to preserve natural label frequencies across train-validation-test splits.

5. Evaluation Metrics and Quantitative Results

Comprehensive evaluation uses label-level metrics suitable for densely and sparsely represented findings. The metrics include F1 score, AUC, accuracy, PPV, NPV, recall, and Cohen's κ. Results from recent Hebrew radiology and clinical models are summarized below.

Model / Metric F1 (mean or median) AUC Cohen’s κ
SMP-BERT + tuning 0.84 [0.76, 0.94] 0.99
SMP-BERT zero-shot 0.58 [0.55, 0.62] 0.88
Standard fine-tuning 0.34 [0.22, 0.85] 0.94
HSMP-BERT 0.83 ± 0.08 0.65 ± 0.17
DictaBERT 2.0 (baseline)
HeMed (vocab adapt) ΔF1 +1.5 to +3.2

HSMP-BERT demonstrates a statistically significant improvement over both zero-shot and standard fine-tuned baselines (ΔF₁(HSMP vs. SFT) = +0.53, paired t-test p < 10⁻⁸) (Badash et al., 3 Sep 2025). The hierarchical prompt-based approach also reduces BERT inference calls by ~5× (2,180,436 → 395,750) and wall-clock runtime by 5.1×.

On clinical timeline extraction, vocabulary adaptation achieves up to +3.2 ΔF1 on out-of-domain event relation tasks, with de-identification entailing minimal (≤0.5) F1 loss (Hashiloni et al., 12 Dec 2025).

6. Downstream Applications and Structural Insights

Hebrew medical LLMs have been applied to:

  • Fine-grained phenotyping: Automated detection of multi-organ, multi-finding disease signatures (e.g., ileal wall thickening, stenosis, pre-stenotic dilatation in IBD) for epidemiological research (Badash et al., 3 Sep 2025).
  • Population-level analysis: Large-scale inference across entire health system corpora to quantify age/sex differences, phenotype co-occurrences (Pearson r > 0.6 for key findings), and domain trends (e.g., pediatric vs. adult inflammatory patterns).
  • Clinical timeline extraction: Construction of directed acyclic event graphs from free-text medical records, supporting automated “patient journey” workflows. Temporal relation classification follows the MATRES schema, and the resulting graphs are assembled using event pairwise predictions and topological sorts (Hashiloni et al., 12 Dec 2025).
  • Low-resource adaptability: The SMP pretraining + prompt-tuning paradigm supports straightforward extension to new medical domains and low-resource languages, with minimal labeled data requirements and flexibility for domain-specific prompt ontologies.

7. Limitations, Privacy, and Future Directions

Several limitations are noted in current Hebrew medical LLMs. Event extraction systems focus on pairwise event relations within narrow context windows, with no global document-level inference. Vocabulary adaptation and segmentation yield incremental, though measurable, gains in token efficiency, but the event detection heuristics may miss domain-specific phenomena (Hashiloni et al., 12 Dec 2025).

Privacy protocols integrate industry-standard and custom de-identification pipelines with demonstrated negligible performance drop (<0.5 F1 loss). Ethical releases are conditioned on IRB approval, restricting research use to formal medical institutions.

Future work includes automation of hierarchical prompting tree construction, generalization to additional medical subdomains outside radiology and clinical timelines, extension to semi-supervised and few-shot learning, and integration of numeric clinical data into structured timelines. Porting these paradigms to global low-resource languages is identified as an ongoing research trajectory (Hazan et al., 2 May 2024, Badash et al., 3 Sep 2025, Hashiloni et al., 12 Dec 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Hebrew Medical Language Model.