ReCAT Model for Clinical Relation Extraction

Updated 24 December 2025

ReCAT is a framework that processes clinical text by integrating MedCAT entity recognition with a transformer-based relation extraction pipeline.
It leverages advanced tokenization, configurable context windows, and ontology-driven negative sampling to achieve robust performance across diverse clinical datasets.
The toolkit provides comprehensive annotation protocols, reproducible code, and rigorous evaluations, enabling scalable and accurate extraction of entity-to-entity relations in clinical narratives.

The ReCAT (Relation Concept Annotation Toolkit) model is an interactive annotation, inference, and training framework for classifying entity-to-entity relations in clinical narratives, specifically those embedded in unstructured Electronic Health Records (EHR). Originating as a major extension to the CogStack MedCAT system, RelCAT addresses the complexity of clinical information spread over textual data—where relations among drugs, findings, and procedures are often diffuse and context-dependent. RelCAT provides flexible annotation, state-of-the-art model architectures, robust evaluation metrics, and reproducible code for advancing clinical relation extraction research (Agarwal et al., 27 Jan 2025).

1. System Architecture and Workflow

RelCAT is structured as a two-stage pipeline augmenting the established MedCAT entity recognition and linking (NER+L) system:

Stage I (MedCAT): Raw clinical text is tokenized, medical entities are recognized and mapped to concept unique identifiers (CUIs, SCTIDs) using databases like SNOMED-CT/UMLS, and context meta-features such as negation and temporality are detected.
Stage II (RelCAT): The system accepts pre-extracted entities with character spans and CUIs, generates all candidate entity pairs within a configurable distance, extracts context windows around each candidate pair, and applies transformer-based models (BERT or Llama) for relation classification. The output is a discrete set of relation labels for each entity pair.

Annotation is carried out through an integrated MedCATTrainer module, which allows annotators to:

Mark entities (with CUI and TUI filtering to enforce ontology constraints).
Specify or correct relations between entity pairs, including relation type, valid distance, and context span.
Benefit from automatic generation of negative (non-relation) examples using TUI-based ontology priors.

2. Input Encoding and Preprocessing

Input text undergoes transformer-compatible tokenization, using either WordPiece (BERT) or Byte-Pair Encoding (Llama3). Special markers ([s1], [e2]) may be inserted to demarcate entities explicitly in the token sequence $\{ t_1, \ldots, t_n \}$ . Context is limited to $\pm K$ tokens around entity boundaries to control inference window size.

Hidden states $H \in \mathbb{R}^{n \times d}$ are extracted from the transformer backbone. For multi-token entities, representations are constructed via max-pooling over the relevant token vectors. Optionally, the global sequence-embedding $h_{\text{cls}}$ may also be incorporated. MedCAT’s linked concept embedding underpins negative sample generation and CUI-filtered annotation.

3. Model Structure and Mathematical Framework

Both BERT and Llama variants implement a unified architecture:

Encoding: $X = [x_1, \ldots, x_n]$ , where $x_i \in \mathbb{R}^d$ .
Transformer Forward: $H = \text{Transformer}(X)$ .
Entity Representation: For entities occupying indices $E^1, E^2$ $E^{1}, E^{2}$ :
- $h_{e^1} = \max_{i \in E^1} H_i$
- $h_{e^2} = \max_{i \in E^2} H_i$
- Concatenated vector $v = [h_{e^1}; h_{e^2}] \in \mathbb{R}^{2d}$ (or $[h_{e^1}; h_{e^2}; h_{\text{cls}}] \in \mathbb{R}^{3d}$ ).
Classification: $z = W_2 \cdot \text{ReLU}(W_1 v + b_1) + b_2$ , $z \in \mathbb{R}^C$ .
Softmax and Loss:
- $p_j = \frac{e^{z_j}}{\sum_k e^{z_k}}$ for class $j=1,\ldots, C$ .
- Cross-entropy: $L = -\sum_j y_j \log p_j$ (weighted: $L_{\text{weighted}} = -\sum_j \alpha_j y_j \log p_j$ for class-imbalance).

No custom attention mechanisms are employed; standard transformer attention is retained, with a two-layer MLP head for classification.

4. Training Regimen and Dataset Details

RelCAT is evaluated on both open and proprietary datasets:

n2c2 2018: 505 discharge summaries, 8 drug-relation classes, $\sim72$ k relations.
NHS Spatial: 119 radiology/pathology reports, 613 spatial relations, $\sim70$ negatives.
NHS Physiotherapy-Mobility: 486 physiotherapy notes, 278 single-instance relations, $\sim70$ negatives.

Preprocessing pipelines apply MedCAT NER+L using SNOMED CT for CUI linkage. Relation pairs are generated within a character distance threshold; negative sampling is ontology-driven using TUI filters.

Optimization settings:

AdamW ( $\beta_1=0.9$ , $\beta_2=0.999$ , $\epsilon=1e-8$ ).
Learning rates: $2e-5$ (BERT), $1e-5$ (Llama).
Batch size: $16-32$.
Epochs: $3-5$.
Warmup: $10\%$ steps.
Class weighting and stratified batching mitigate imbalances.

Annotation proceeds by launching MedCATTrainer with CUI/TUI filters, marking entities and subsequently relations, and exporting the annotated corpus for model training.

5. Evaluation Metrics and Comparative Analysis

Performance is measured per class and averaged macro/micro:

Precision: $\text{Precision}_j = \frac{TP_j}{TP_j + FP_j}$
Recall: $\text{Recall}_j = \frac{TP_j}{TP_j + FN_j}$
F1: $F1_j = 2 \cdot \frac{\text{Precision}_j \cdot \text{Recall}_j}{\text{Precision}_j + \text{Recall}_j}$
Macro/micro average and overall accuracy computed as described above.

Performance Summary

Dataset	Model	F1 Macro	Accuracy	SOTA Reference
n2c2 (gold-standard)	BERT (unfrozen)	0.977	0.977	0.956 / 0.961
NHS Spatial	BERT (unfrozen)	0.902	—	—
NHS Spatial	n2c2-pretrained BERT	0.918	—	—
NHS Spatial	Llama (frozen)	0.933	—	—
NHS Physio-Mobility	BERT (unfrozen)	0.905	—	—
NHS Physio-Mobility	n2c2-pretrained BERT	0.938	—	—
NHS Physio-Mobility	Llama (frozen)	0.835	—	—

Minority classes on n2c2 (ADE-Drug, Duration-Drug) achieve F1 = 0.866 and 0.933. Zero/few-shot in-context LLMs (Llama 3.1, Mistral 7B) underperform at F1 = 0.29–0.49, particularly in minority classes.

6. Practical Guidance and Known Limitations

Deployment on a new corpus requires sequential integration with MedCAT, initial relation annotation, strategic negative sampling, transformer fine-tuning, and API-driven inference:

Integrate corpus into CogStack MedCAT; build/load NER+L for target ontology.
Annotate small seed of entity-to-entity relations.
Export labeled data; configure negative sampling.
Fine-tune transformer (BERT/Llama) using toolkit protocols.
Deploy via RelCAT API: text → MedCAT NER+L → RelCAT classifier → output relations as JSON.

Key limitations include heuristic, context-sensitive definition of non-relations, Llama overfitting on limited data, and a pipeline architecture that may decouple named entity recognition and relation signals. Future work aims at ontology-driven candidate relation discovery, extension to cross-sentence relations, and end-to-end architectures for integrated NER and relation extraction.

7. Availability and Reproducibility

Source code, annotation pipelines, and documentation are publicly available in the CogStack/MedCAT GitHub repository (https://github.com/CogStack/MedCAT). This enables full reproducibility of reported results, annotation workflows, and model evaluations.

RelCAT represents a high-fidelity, transformer-based approach for clinical relation extraction, exceeding prior state-of-the-art benchmarks and providing a comprehensive open-source toolkit for the community (Agarwal et al., 27 Jan 2025).

PDF Markdown Chat (Pro)

References (1)

RelCAT: Advancing Extraction of Clinical Inter-Entity Relationships from Unstructured Electronic Health Records (2025)

ReCAT Model for Clinical Relation Extraction

1. System Architecture and Workflow

2. Input Encoding and Preprocessing

3. Model Structure and Mathematical Framework

4. Training Regimen and Dataset Details

5. Evaluation Metrics and Comparative Analysis

Performance Summary

6. Practical Guidance and Known Limitations

7. Availability and Reproducibility

Whiteboard

Follow Topic

Continue Learning

ReCAT Model for Clinical Relation Extraction

1. System Architecture and Workflow

2. Input Encoding and Preprocessing

3. Model Structure and Mathematical Framework

4. Training Regimen and Dataset Details

5. Evaluation Metrics and Comparative Analysis

Performance Summary

6. Practical Guidance and Known Limitations

7. Availability and Reproducibility

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics