Papers
Topics
Authors
Recent
2000 character limit reached

ReCAT Model for Clinical Relation Extraction

Updated 24 December 2025
  • ReCAT is a framework that processes clinical text by integrating MedCAT entity recognition with a transformer-based relation extraction pipeline.
  • It leverages advanced tokenization, configurable context windows, and ontology-driven negative sampling to achieve robust performance across diverse clinical datasets.
  • The toolkit provides comprehensive annotation protocols, reproducible code, and rigorous evaluations, enabling scalable and accurate extraction of entity-to-entity relations in clinical narratives.

The ReCAT (Relation Concept Annotation Toolkit) model is an interactive annotation, inference, and training framework for classifying entity-to-entity relations in clinical narratives, specifically those embedded in unstructured Electronic Health Records (EHR). Originating as a major extension to the CogStack MedCAT system, RelCAT addresses the complexity of clinical information spread over textual data—where relations among drugs, findings, and procedures are often diffuse and context-dependent. RelCAT provides flexible annotation, state-of-the-art model architectures, robust evaluation metrics, and reproducible code for advancing clinical relation extraction research (Agarwal et al., 27 Jan 2025).

1. System Architecture and Workflow

RelCAT is structured as a two-stage pipeline augmenting the established MedCAT entity recognition and linking (NER+L) system:

  • Stage I (MedCAT): Raw clinical text is tokenized, medical entities are recognized and mapped to concept unique identifiers (CUIs, SCTIDs) using databases like SNOMED-CT/UMLS, and context meta-features such as negation and temporality are detected.
  • Stage II (RelCAT): The system accepts pre-extracted entities with character spans and CUIs, generates all candidate entity pairs within a configurable distance, extracts context windows around each candidate pair, and applies transformer-based models (BERT or Llama) for relation classification. The output is a discrete set of relation labels for each entity pair.

Annotation is carried out through an integrated MedCATTrainer module, which allows annotators to:

  • Mark entities (with CUI and TUI filtering to enforce ontology constraints).
  • Specify or correct relations between entity pairs, including relation type, valid distance, and context span.
  • Benefit from automatic generation of negative (non-relation) examples using TUI-based ontology priors.

2. Input Encoding and Preprocessing

Input text undergoes transformer-compatible tokenization, using either WordPiece (BERT) or Byte-Pair Encoding (Llama3). Special markers ([s1], [e2]) may be inserted to demarcate entities explicitly in the token sequence {t1,,tn}\{ t_1, \ldots, t_n \}. Context is limited to ±K\pm K tokens around entity boundaries to control inference window size.

Hidden states HRn×dH \in \mathbb{R}^{n \times d} are extracted from the transformer backbone. For multi-token entities, representations are constructed via max-pooling over the relevant token vectors. Optionally, the global sequence-embedding hclsh_{\text{cls}} may also be incorporated. MedCAT’s linked concept embedding underpins negative sample generation and CUI-filtered annotation.

3. Model Structure and Mathematical Framework

Both BERT and Llama variants implement a unified architecture:

  • Encoding: X=[x1,,xn]X = [x_1, \ldots, x_n], where xiRdx_i \in \mathbb{R}^d.
  • Transformer Forward: H=Transformer(X)H = \text{Transformer}(X).
  • Entity Representation: For entities occupying indices E1,E2E^1, E^2:
    • he1=maxiE1Hih_{e^1} = \max_{i \in E^1} H_i
    • he2=maxiE2Hih_{e^2} = \max_{i \in E^2} H_i
    • Concatenated vector v=[he1;he2]R2dv = [h_{e^1}; h_{e^2}] \in \mathbb{R}^{2d} (or [he1;he2;hcls]R3d[h_{e^1}; h_{e^2}; h_{\text{cls}}] \in \mathbb{R}^{3d}).
  • Classification: z=W2ReLU(W1v+b1)+b2z = W_2 \cdot \text{ReLU}(W_1 v + b_1) + b_2, zRCz \in \mathbb{R}^C.
  • Softmax and Loss:
    • pj=ezjkezkp_j = \frac{e^{z_j}}{\sum_k e^{z_k}} for class j=1,,Cj=1,\ldots, C.
    • Cross-entropy: L=jyjlogpjL = -\sum_j y_j \log p_j (weighted: Lweighted=jαjyjlogpjL_{\text{weighted}} = -\sum_j \alpha_j y_j \log p_j for class-imbalance).

No custom attention mechanisms are employed; standard transformer attention is retained, with a two-layer MLP head for classification.

4. Training Regimen and Dataset Details

RelCAT is evaluated on both open and proprietary datasets:

  • n2c2 2018: 505 discharge summaries, 8 drug-relation classes, 72\sim72k relations.
  • NHS Spatial: 119 radiology/pathology reports, 613 spatial relations, 70\sim70 negatives.
  • NHS Physiotherapy-Mobility: 486 physiotherapy notes, 278 single-instance relations, 70\sim70 negatives.

Preprocessing pipelines apply MedCAT NER+L using SNOMED CT for CUI linkage. Relation pairs are generated within a character distance threshold; negative sampling is ontology-driven using TUI filters.

Optimization settings:

  • AdamW (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, ϵ=1e8\epsilon=1e-8).
  • Learning rates: $2e-5$ (BERT), $1e-5$ (Llama).
  • Batch size: $16-32$.
  • Epochs: $3-5$.
  • Warmup: 10%10\% steps.
  • Class weighting and stratified batching mitigate imbalances.

Annotation proceeds by launching MedCATTrainer with CUI/TUI filters, marking entities and subsequently relations, and exporting the annotated corpus for model training.

5. Evaluation Metrics and Comparative Analysis

Performance is measured per class and averaged macro/micro:

  • Precision: Precisionj=TPjTPj+FPj\text{Precision}_j = \frac{TP_j}{TP_j + FP_j}
  • Recall: Recallj=TPjTPj+FNj\text{Recall}_j = \frac{TP_j}{TP_j + FN_j}
  • F1: F1j=2PrecisionjRecalljPrecisionj+RecalljF1_j = 2 \cdot \frac{\text{Precision}_j \cdot \text{Recall}_j}{\text{Precision}_j + \text{Recall}_j}
  • Macro/micro average and overall accuracy computed as described above.

Performance Summary

Dataset Model F1 Macro Accuracy SOTA Reference
n2c2 (gold-standard) BERT (unfrozen) 0.977 0.977 0.956 / 0.961
NHS Spatial BERT (unfrozen) 0.902
NHS Spatial n2c2-pretrained BERT 0.918
NHS Spatial Llama (frozen) 0.933
NHS Physio-Mobility BERT (unfrozen) 0.905
NHS Physio-Mobility n2c2-pretrained BERT 0.938
NHS Physio-Mobility Llama (frozen) 0.835

Minority classes on n2c2 (ADE-Drug, Duration-Drug) achieve F1 = 0.866 and 0.933. Zero/few-shot in-context LLMs (Llama 3.1, Mistral 7B) underperform at F1 = 0.29–0.49, particularly in minority classes.

6. Practical Guidance and Known Limitations

Deployment on a new corpus requires sequential integration with MedCAT, initial relation annotation, strategic negative sampling, transformer fine-tuning, and API-driven inference:

  1. Integrate corpus into CogStack MedCAT; build/load NER+L for target ontology.
  2. Annotate small seed of entity-to-entity relations.
  3. Export labeled data; configure negative sampling.
  4. Fine-tune transformer (BERT/Llama) using toolkit protocols.
  5. Deploy via RelCAT API: text → MedCAT NER+L → RelCAT classifier → output relations as JSON.

Key limitations include heuristic, context-sensitive definition of non-relations, Llama overfitting on limited data, and a pipeline architecture that may decouple named entity recognition and relation signals. Future work aims at ontology-driven candidate relation discovery, extension to cross-sentence relations, and end-to-end architectures for integrated NER and relation extraction.

7. Availability and Reproducibility

Source code, annotation pipelines, and documentation are publicly available in the CogStack/MedCAT GitHub repository (https://github.com/CogStack/MedCAT). This enables full reproducibility of reported results, annotation workflows, and model evaluations.

RelCAT represents a high-fidelity, transformer-based approach for clinical relation extraction, exceeding prior state-of-the-art benchmarks and providing a comprehensive open-source toolkit for the community (Agarwal et al., 27 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ReCAT Model.