Papers
Topics
Authors
Recent
Search
2000 character limit reached

medSpaCy: Clinical NLP Pipeline Library

Updated 13 March 2026
  • medspaCy is an extensible, open-source library for clinical NLP that leverages spaCy’s modular structure for rapid development and customization of pipelines.
  • It integrates rule-based and machine learning components—such as the ConText algorithm and QuickUMLS—for precise context analysis and terminology mapping.
  • It has proven effective in operational settings like COVID-19 surveillance, achieving high precision and recall in extracting actionable clinical insights.

medspaCy is an extensible, open-source library for clinical natural language processing (cNLP) in Python, architected as a lightweight extension on the spaCy NLP framework. It enables seamless integration of rule-based and ML algorithms for the extraction and contextualization of clinical information from unstructured text. By leveraging spaCy’s object-oriented Doc/Span/Token structures and pipeline conventions, medspaCy facilitates rapid development, customization, and operational deployment of cNLP pipelines, with a focus on integrating domain-specific context analysis and terminology mapping (Eyre et al., 2021).

1. System Architecture and Pipeline Design

medspaCy is constructed directly atop spaCy’s API, preserving and extending its Doc/Span/Token object model and pipeline registration mechanism. The canonical medspaCy pipeline comprises multiple modular components, typically organized as follows:

  1. Tokenizer: Customizable rules targeting clinical punctuation and whitespace conventions.
  2. Sentence Segmenter: PyRuSH, a rule-based clinical sentence detector.
  3. Sectionizer: Rule-based matching for document section headers (e.g., "Past Medical History").
  4. Concept Matcher / UMLS Mapper: Including QuickUMLS for biomedical concept normalization.
  5. Context Analyzer: Implements the ConText algorithm for scope-based contextualization.
  6. Optional ML Components: spaCy or third-party ML modules (NER, textcat, transformers).
  7. Serializer or Output Utility: For downstream integration or results processing.

Pipeline components are registered via nlp.add_pipe("component_name", **kwargs) and operate on attributes exposed through spaCy’s extension registry (e.g., Span._.is_negated, Doc._.sections). Extensibility is supported through custom rule files/patterns, user-defined components using spaCy’s @Language.component decorator, and the introduction of new Doc/Span/Token attributes to accommodate emergent metadata.

2. Core Functionalities and Algorithms

2.1 Context Analysis: ConText Algorithm

medspaCy’s context analyzer component implements the ConText algorithm for detection of negation, temporality, certainty, and experiencer. Modifier rules in medspaCy are defined by a literal or pattern (spaCy matcher-style or regex), semantic category (e.g., NEGATION, HYPOTHETICAL), directionality (forward, backward, or bidirectional), and a contextual window (tokens or sentence scope).

Example rule definition:

1
2
3
4
5
6
7
{
  "literal": ["no", "denies"],
  "category": "NEGATION",
  "pattern": [{"LOWER": "no"}],
  "direction": "forward",
  "window": "SENTENCE"
}

The core pseudocode algorithm is:

1
2
3
4
5
for modifier in doc._.modifiers:
    for ent in doc.ents:
        if in_scope(modifier, ent, modifier.window):
            ent._.modifiers.add(modifier.category)
            # e.g. ent._.is_negated = True if category == "NEGATION"
Contextual attributes are then affixed to Span objects, including ent._.is_negated (boolean), ent._.temporality (e.g., "PAST", "HYPOTHETICAL"), and ent._.experiencer ("PATIENT", "FAMILY").

2.2 Terminology Mapping: QuickUMLS Integration

medspaCy incorporates QuickUMLS as a spaCy pipeline component for terminological normalization against the Unified Medical Language System (UMLS). The mapping procedure is bifurcated into an indexing phase (pre-runtime) and the runtime matching phase.

Indexing phase:

  • Loads UMLS sample data (CUIs to concept strings).
  • Normalizes concept strings (lowercasing, stripping punctuation), then generates character n-grams.
  • Stores an inverted index mapping n-grams to sets of CUIs.

Runtime phase:

  • Iterates over candidate token spans in the input Doc.
  • For each span, normalizes, generates n-grams, and retrieves candidate CUIs as the union across indexed n-grams.
  • Computes similarity SS between the span and concepts using the Sorensen–Dice coefficient:

S=2⋅∣Gspan∩Gconcept∣∣Gspan∣+∣Gconcept∣S = \frac{2 \cdot |G_{span} \cap G_{concept}|}{|G_{span}| + |G_{concept}|}

  • If S≥θS \geq \theta (threshold, e.g., 0.7), emits a Span with span._.cui and associated metadata.

Attributes such as span._.umls_cuis and span._.umls_definitions are attached to matched entities for downstream analytic and curation tasks.

3. Rule-Based and Machine Learning Integration

medspaCy pipelines explicitly support interleaving rule-based and ML-driven components. Components such as rule-based concept matchers (e.g., for medication mentions) can be sequenced with statistical NER models or text classifiers provided by spaCy or third-party frameworks. A representative configuration might include rule-based sectionizers, concept matchers, QuickUMLS for terminology normalization, the ConText context analyzer, and a terminal text classifier.

Example configuration sequence:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import spacy
import medspacy
from medspacy.section_detection import Sectionizer
from medspacy.context import ConTextComponent

nlp = spacy.load("en_core_web_sm", disable=["ner"])
sectionizer = nlp.add_pipe("medspacy_sectionizer")
sectionizer.load_sections("/path/to/sections.json")
concept_matcher = nlp.add_pipe("medspacy_concept_matcher")
concept_matcher.add(label="COVID_TEST", pattern=[{"LOWER":"covid"}, {"LOWER":"test"}])
umls = nlp.add_pipe("quickumls", config={"threshold":0.75, "overmatch_policy":"first"})
context = nlp.add_pipe("medspacy_context")
context.add_modifier("no history of", "NEGATION", direction="forward", window="SENTENCE")
nlp.add_pipe("textcat", name="covid_classifier", last=True)
A runtime invocation on clinical text demonstrates the propagation of entity-level attributes such as negation status and assigned CUI codes.

4. Extensibility and Customization

The design of medspaCy centers on extensibility via both rule customization and component augmentation. Users may define or modify context modifiers (context.add_modifier) and add custom rule patterns to concept matchers (matcher.add()), tailoring pipelines to specific clinical sub-domains or annotation schemas. Custom pipeline components can be constructed using spaCy’s decorator paradigm:

1
2
3
4
5
6
7
8
9
10
11
from spacy.language import Language

@Language.component("my_clinical_filter")
def my_clinical_filter(doc):
    # Remove entities from sections titled "DISCHARGE"
    for ent in list(doc.ents):
        if ent._.section_title == "DISCHARGE":
            doc.ents = [e for e in doc.ents if e != ent]
    return doc

nlp.add_pipe("my_clinical_filter", after="medspacy_sectionizer")

Because medspaCy’s API extensions are minimal and non-intrusive, arbitrary spaCy and third-party components (e.g., scispaCy’s AbbreviationDetector or transformer-based NER models) can be composed with medspaCy’s rule-based modules within the same execution pipeline.

5. Practical Applications and Empirical Evaluation

medspaCy’s efficacy has been demonstrated in large-scale operational contexts. Its components have been deployed to process over 63 million documents in the Veterans Affairs (VA) COVID-19 surveillance pipeline, attaining 82.5% precision and 94.2% recall in identifying positive COVID-19 test mentions. medspaCy also underpins chief complaint syndromic surveillance, operational since 2019, processing triage notes for 3 million patients. Although explicit runtime benchmarks are not reported, medspaCy inherits the Cython-optimized efficiency of spaCy and the sub-second matching speeds of QuickUMLS.

Use cases relied on iterative rule development and integrated visualization utilities for highlighting entities and contextual boundaries, facilitating rapid prototyping and deployment of clinical NLP pipelines at scale. A plausible implication is the practical effectiveness of medspaCy’s design principles for high-throughput, domain-specialized text mining tasks in clinical informatics (Eyre et al., 2021).

6. Usage Example and Workflow Patterns

medspaCy provides a unified API for pipeline composition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import medspacy
from medspacy.section_detection import Sectionizer
from medspacy.context import ConTextComponent

nlp = medspacy.load()
sectionizer = nlp.add_pipe("medspacy_sectionizer")
sectionizer.load_sections("default_sections.json")
context = nlp.add_pipe("medspacy_context")
context.add_modifier("no", "NEGATION", direction="forward", window="SENTENCE")
context.add_modifier("history of", "HISTORICAL", direction="forward", window=10)
matcher = nlp.add_pipe("medspacy_concept_matcher")
matcher.add("PAIN_PATTERN", label="PAIN", pattern=[{"LOWER":"chest"}, {"LOWER":"pain"}])

text = "In the Past Medical History section: no chest pain; now he reports chest pain."
doc = nlp(text)
for sec in doc._.sections:
    print(sec.title, sec.start, sec.end)
for ent in doc.ents:
    print(ent.text, ent.label_, ent._.is_negated, ent._.temporality)
This pattern demonstrates construction of a clinical NLP pipeline, section-based processing, specification of context rules, and post-processing over structured entity/section metadata.

7. Summary Table: Core Components

Component Functionality Customization Points
Tokenizer Clinical punctuation and whitespace handling Custom rules
Sentence Segmenter PyRuSH clinical sentence detection Rule files
Sectionizer Rule-based section header matching Section definition JSON
Concept Matcher Pattern-based entity extraction Patterns/rules
QuickUMLS UMLS concept mapping, Sorensen–Dice similarity Threshold, index
ConText Negation, temporality, experiencer, certainty detection Modifier rules
ML Components NER, text classification (spaCy/third-party) Model swap/config

This structure illustrates the modular composability and surface-level extensibility characteristic of medspaCy (Eyre et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MedSpaCy.