medSpaCy: Clinical NLP Pipeline Library
- medspaCy is an extensible, open-source library for clinical NLP that leverages spaCy’s modular structure for rapid development and customization of pipelines.
- It integrates rule-based and machine learning components—such as the ConText algorithm and QuickUMLS—for precise context analysis and terminology mapping.
- It has proven effective in operational settings like COVID-19 surveillance, achieving high precision and recall in extracting actionable clinical insights.
medspaCy is an extensible, open-source library for clinical natural language processing (cNLP) in Python, architected as a lightweight extension on the spaCy NLP framework. It enables seamless integration of rule-based and ML algorithms for the extraction and contextualization of clinical information from unstructured text. By leveraging spaCy’s object-oriented Doc/Span/Token structures and pipeline conventions, medspaCy facilitates rapid development, customization, and operational deployment of cNLP pipelines, with a focus on integrating domain-specific context analysis and terminology mapping (Eyre et al., 2021).
1. System Architecture and Pipeline Design
medspaCy is constructed directly atop spaCy’s API, preserving and extending its Doc/Span/Token object model and pipeline registration mechanism. The canonical medspaCy pipeline comprises multiple modular components, typically organized as follows:
- Tokenizer: Customizable rules targeting clinical punctuation and whitespace conventions.
- Sentence Segmenter: PyRuSH, a rule-based clinical sentence detector.
- Sectionizer: Rule-based matching for document section headers (e.g., "Past Medical History").
- Concept Matcher / UMLS Mapper: Including QuickUMLS for biomedical concept normalization.
- Context Analyzer: Implements the ConText algorithm for scope-based contextualization.
- Optional ML Components: spaCy or third-party ML modules (NER, textcat, transformers).
- Serializer or Output Utility: For downstream integration or results processing.
Pipeline components are registered via nlp.add_pipe("component_name", **kwargs) and operate on attributes exposed through spaCy’s extension registry (e.g., Span._.is_negated, Doc._.sections). Extensibility is supported through custom rule files/patterns, user-defined components using spaCy’s @Language.component decorator, and the introduction of new Doc/Span/Token attributes to accommodate emergent metadata.
2. Core Functionalities and Algorithms
2.1 Context Analysis: ConText Algorithm
medspaCy’s context analyzer component implements the ConText algorithm for detection of negation, temporality, certainty, and experiencer. Modifier rules in medspaCy are defined by a literal or pattern (spaCy matcher-style or regex), semantic category (e.g., NEGATION, HYPOTHETICAL), directionality (forward, backward, or bidirectional), and a contextual window (tokens or sentence scope).
Example rule definition:
1 2 3 4 5 6 7 |
{
"literal": ["no", "denies"],
"category": "NEGATION",
"pattern": [{"LOWER": "no"}],
"direction": "forward",
"window": "SENTENCE"
} |
The core pseudocode algorithm is:
1 2 3 4 5 |
for modifier in doc._.modifiers: for ent in doc.ents: if in_scope(modifier, ent, modifier.window): ent._.modifiers.add(modifier.category) # e.g. ent._.is_negated = True if category == "NEGATION" |
ent._.is_negated (boolean), ent._.temporality (e.g., "PAST", "HYPOTHETICAL"), and ent._.experiencer ("PATIENT", "FAMILY").
2.2 Terminology Mapping: QuickUMLS Integration
medspaCy incorporates QuickUMLS as a spaCy pipeline component for terminological normalization against the Unified Medical Language System (UMLS). The mapping procedure is bifurcated into an indexing phase (pre-runtime) and the runtime matching phase.
Indexing phase:
- Loads UMLS sample data (CUIs to concept strings).
- Normalizes concept strings (lowercasing, stripping punctuation), then generates character n-grams.
- Stores an inverted index mapping n-grams to sets of CUIs.
Runtime phase:
- Iterates over candidate token spans in the input Doc.
- For each span, normalizes, generates n-grams, and retrieves candidate CUIs as the union across indexed n-grams.
- Computes similarity between the span and concepts using the Sorensen–Dice coefficient:
- If (threshold, e.g., 0.7), emits a Span with
span._.cuiand associated metadata.
Attributes such as span._.umls_cuis and span._.umls_definitions are attached to matched entities for downstream analytic and curation tasks.
3. Rule-Based and Machine Learning Integration
medspaCy pipelines explicitly support interleaving rule-based and ML-driven components. Components such as rule-based concept matchers (e.g., for medication mentions) can be sequenced with statistical NER models or text classifiers provided by spaCy or third-party frameworks. A representative configuration might include rule-based sectionizers, concept matchers, QuickUMLS for terminology normalization, the ConText context analyzer, and a terminal text classifier.
Example configuration sequence:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import spacy import medspacy from medspacy.section_detection import Sectionizer from medspacy.context import ConTextComponent nlp = spacy.load("en_core_web_sm", disable=["ner"]) sectionizer = nlp.add_pipe("medspacy_sectionizer") sectionizer.load_sections("/path/to/sections.json") concept_matcher = nlp.add_pipe("medspacy_concept_matcher") concept_matcher.add(label="COVID_TEST", pattern=[{"LOWER":"covid"}, {"LOWER":"test"}]) umls = nlp.add_pipe("quickumls", config={"threshold":0.75, "overmatch_policy":"first"}) context = nlp.add_pipe("medspacy_context") context.add_modifier("no history of", "NEGATION", direction="forward", window="SENTENCE") nlp.add_pipe("textcat", name="covid_classifier", last=True) |
4. Extensibility and Customization
The design of medspaCy centers on extensibility via both rule customization and component augmentation. Users may define or modify context modifiers (context.add_modifier) and add custom rule patterns to concept matchers (matcher.add()), tailoring pipelines to specific clinical sub-domains or annotation schemas. Custom pipeline components can be constructed using spaCy’s decorator paradigm:
1 2 3 4 5 6 7 8 9 10 11 |
from spacy.language import Language @Language.component("my_clinical_filter") def my_clinical_filter(doc): # Remove entities from sections titled "DISCHARGE" for ent in list(doc.ents): if ent._.section_title == "DISCHARGE": doc.ents = [e for e in doc.ents if e != ent] return doc nlp.add_pipe("my_clinical_filter", after="medspacy_sectionizer") |
Because medspaCy’s API extensions are minimal and non-intrusive, arbitrary spaCy and third-party components (e.g., scispaCy’s AbbreviationDetector or transformer-based NER models) can be composed with medspaCy’s rule-based modules within the same execution pipeline.
5. Practical Applications and Empirical Evaluation
medspaCy’s efficacy has been demonstrated in large-scale operational contexts. Its components have been deployed to process over 63 million documents in the Veterans Affairs (VA) COVID-19 surveillance pipeline, attaining 82.5% precision and 94.2% recall in identifying positive COVID-19 test mentions. medspaCy also underpins chief complaint syndromic surveillance, operational since 2019, processing triage notes for 3 million patients. Although explicit runtime benchmarks are not reported, medspaCy inherits the Cython-optimized efficiency of spaCy and the sub-second matching speeds of QuickUMLS.
Use cases relied on iterative rule development and integrated visualization utilities for highlighting entities and contextual boundaries, facilitating rapid prototyping and deployment of clinical NLP pipelines at scale. A plausible implication is the practical effectiveness of medspaCy’s design principles for high-throughput, domain-specialized text mining tasks in clinical informatics (Eyre et al., 2021).
6. Usage Example and Workflow Patterns
medspaCy provides a unified API for pipeline composition:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import medspacy from medspacy.section_detection import Sectionizer from medspacy.context import ConTextComponent nlp = medspacy.load() sectionizer = nlp.add_pipe("medspacy_sectionizer") sectionizer.load_sections("default_sections.json") context = nlp.add_pipe("medspacy_context") context.add_modifier("no", "NEGATION", direction="forward", window="SENTENCE") context.add_modifier("history of", "HISTORICAL", direction="forward", window=10) matcher = nlp.add_pipe("medspacy_concept_matcher") matcher.add("PAIN_PATTERN", label="PAIN", pattern=[{"LOWER":"chest"}, {"LOWER":"pain"}]) text = "In the Past Medical History section: no chest pain; now he reports chest pain." doc = nlp(text) for sec in doc._.sections: print(sec.title, sec.start, sec.end) for ent in doc.ents: print(ent.text, ent.label_, ent._.is_negated, ent._.temporality) |
7. Summary Table: Core Components
| Component | Functionality | Customization Points |
|---|---|---|
| Tokenizer | Clinical punctuation and whitespace handling | Custom rules |
| Sentence Segmenter | PyRuSH clinical sentence detection | Rule files |
| Sectionizer | Rule-based section header matching | Section definition JSON |
| Concept Matcher | Pattern-based entity extraction | Patterns/rules |
| QuickUMLS | UMLS concept mapping, Sorensen–Dice similarity | Threshold, index |
| ConText | Negation, temporality, experiencer, certainty detection | Modifier rules |
| ML Components | NER, text classification (spaCy/third-party) | Model swap/config |
This structure illustrates the modular composability and surface-level extensibility characteristic of medspaCy (Eyre et al., 2021).