ReCAT Model for Clinical Relation Extraction
- ReCAT is a framework that processes clinical text by integrating MedCAT entity recognition with a transformer-based relation extraction pipeline.
- It leverages advanced tokenization, configurable context windows, and ontology-driven negative sampling to achieve robust performance across diverse clinical datasets.
- The toolkit provides comprehensive annotation protocols, reproducible code, and rigorous evaluations, enabling scalable and accurate extraction of entity-to-entity relations in clinical narratives.
The ReCAT (Relation Concept Annotation Toolkit) model is an interactive annotation, inference, and training framework for classifying entity-to-entity relations in clinical narratives, specifically those embedded in unstructured Electronic Health Records (EHR). Originating as a major extension to the CogStack MedCAT system, RelCAT addresses the complexity of clinical information spread over textual data—where relations among drugs, findings, and procedures are often diffuse and context-dependent. RelCAT provides flexible annotation, state-of-the-art model architectures, robust evaluation metrics, and reproducible code for advancing clinical relation extraction research (Agarwal et al., 27 Jan 2025).
1. System Architecture and Workflow
RelCAT is structured as a two-stage pipeline augmenting the established MedCAT entity recognition and linking (NER+L) system:
- Stage I (MedCAT): Raw clinical text is tokenized, medical entities are recognized and mapped to concept unique identifiers (CUIs, SCTIDs) using databases like SNOMED-CT/UMLS, and context meta-features such as negation and temporality are detected.
- Stage II (RelCAT): The system accepts pre-extracted entities with character spans and CUIs, generates all candidate entity pairs within a configurable distance, extracts context windows around each candidate pair, and applies transformer-based models (BERT or Llama) for relation classification. The output is a discrete set of relation labels for each entity pair.
Annotation is carried out through an integrated MedCATTrainer module, which allows annotators to:
- Mark entities (with CUI and TUI filtering to enforce ontology constraints).
- Specify or correct relations between entity pairs, including relation type, valid distance, and context span.
- Benefit from automatic generation of negative (non-relation) examples using TUI-based ontology priors.
2. Input Encoding and Preprocessing
Input text undergoes transformer-compatible tokenization, using either WordPiece (BERT) or Byte-Pair Encoding (Llama3). Special markers ([s1], [e2]) may be inserted to demarcate entities explicitly in the token sequence . Context is limited to tokens around entity boundaries to control inference window size.
Hidden states are extracted from the transformer backbone. For multi-token entities, representations are constructed via max-pooling over the relevant token vectors. Optionally, the global sequence-embedding may also be incorporated. MedCAT’s linked concept embedding underpins negative sample generation and CUI-filtered annotation.
3. Model Structure and Mathematical Framework
Both BERT and Llama variants implement a unified architecture:
- Encoding: , where .
- Transformer Forward: .
- Entity Representation: For entities occupying indices :
- Concatenated vector (or ).
- Classification: , .
- Softmax and Loss:
- for class .
- Cross-entropy: (weighted: for class-imbalance).
No custom attention mechanisms are employed; standard transformer attention is retained, with a two-layer MLP head for classification.
4. Training Regimen and Dataset Details
RelCAT is evaluated on both open and proprietary datasets:
- n2c2 2018: 505 discharge summaries, 8 drug-relation classes, k relations.
- NHS Spatial: 119 radiology/pathology reports, 613 spatial relations, negatives.
- NHS Physiotherapy-Mobility: 486 physiotherapy notes, 278 single-instance relations, negatives.
Preprocessing pipelines apply MedCAT NER+L using SNOMED CT for CUI linkage. Relation pairs are generated within a character distance threshold; negative sampling is ontology-driven using TUI filters.
Optimization settings:
- AdamW (, , ).
- Learning rates: $2e-5$ (BERT), $1e-5$ (Llama).
- Batch size: $16-32$.
- Epochs: $3-5$.
- Warmup: steps.
- Class weighting and stratified batching mitigate imbalances.
Annotation proceeds by launching MedCATTrainer with CUI/TUI filters, marking entities and subsequently relations, and exporting the annotated corpus for model training.
5. Evaluation Metrics and Comparative Analysis
Performance is measured per class and averaged macro/micro:
- Precision:
- Recall:
- F1:
- Macro/micro average and overall accuracy computed as described above.
Performance Summary
| Dataset | Model | F1 Macro | Accuracy | SOTA Reference |
|---|---|---|---|---|
| n2c2 (gold-standard) | BERT (unfrozen) | 0.977 | 0.977 | 0.956 / 0.961 |
| NHS Spatial | BERT (unfrozen) | 0.902 | — | — |
| NHS Spatial | n2c2-pretrained BERT | 0.918 | — | — |
| NHS Spatial | Llama (frozen) | 0.933 | — | — |
| NHS Physio-Mobility | BERT (unfrozen) | 0.905 | — | — |
| NHS Physio-Mobility | n2c2-pretrained BERT | 0.938 | — | — |
| NHS Physio-Mobility | Llama (frozen) | 0.835 | — | — |
Minority classes on n2c2 (ADE-Drug, Duration-Drug) achieve F1 = 0.866 and 0.933. Zero/few-shot in-context LLMs (Llama 3.1, Mistral 7B) underperform at F1 = 0.29–0.49, particularly in minority classes.
6. Practical Guidance and Known Limitations
Deployment on a new corpus requires sequential integration with MedCAT, initial relation annotation, strategic negative sampling, transformer fine-tuning, and API-driven inference:
- Integrate corpus into CogStack MedCAT; build/load NER+L for target ontology.
- Annotate small seed of entity-to-entity relations.
- Export labeled data; configure negative sampling.
- Fine-tune transformer (BERT/Llama) using toolkit protocols.
- Deploy via RelCAT API: text → MedCAT NER+L → RelCAT classifier → output relations as JSON.
Key limitations include heuristic, context-sensitive definition of non-relations, Llama overfitting on limited data, and a pipeline architecture that may decouple named entity recognition and relation signals. Future work aims at ontology-driven candidate relation discovery, extension to cross-sentence relations, and end-to-end architectures for integrated NER and relation extraction.
7. Availability and Reproducibility
Source code, annotation pipelines, and documentation are publicly available in the CogStack/MedCAT GitHub repository (https://github.com/CogStack/MedCAT). This enables full reproducibility of reported results, annotation workflows, and model evaluations.
RelCAT represents a high-fidelity, transformer-based approach for clinical relation extraction, exceeding prior state-of-the-art benchmarks and providing a comprehensive open-source toolkit for the community (Agarwal et al., 27 Jan 2025).