Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Initial Investigation of LLM-Assisted Development of Rule-Based Clinical NLP System (2506.16628v1)

Published 19 Jun 2025 in cs.CL and cs.LG

Abstract: Despite advances in ML and LLMs, rule-based NLP systems remain active in clinical settings due to their interpretability and operational efficiency. However, their manual development and maintenance are labor-intensive, particularly in tasks with large linguistic variability. To overcome these limitations, we proposed a novel approach employing LLMs solely during the rule-based systems development phase. We conducted the initial experiments focusing on the first two steps of developing a rule-based NLP pipeline: find relevant snippets from the clinical note; extract informative keywords from the snippets for the rule-based named entity recognition (NER) component. Our experiments demonstrated exceptional recall in identifying clinically relevant text snippets (Deepseek: 0.98, Qwen: 0.99) and 1.0 in extracting key terms for NER. This study sheds light on a promising new direction for NLP development, enabling semi-automated or automated development of rule-based systems with significantly faster, more cost-effective, and transparent execution compared with deep learning model-based solutions.

LLM-Assisted Rule-Based Development in Clinical NLP: An Empirical Evaluation

This paper addresses the integration of LLMs into the development lifecycle of rule-based clinical NLP systems, specifically targeting improvements in efficiency, scalability, and transparency while maintaining operational viability for healthcare settings. Despite the rise of machine learning and LLM-driven NLP, rule-based solutions persist in clinical applications owing to their interpretability and low operational cost. However, their continued evolution is hindered by prohibitively labor-intensive manual rule engineering, particularly in linguistically variable domains.

Study Design and Implementation

The authors propose employing LLMs exclusively during the rule-creation phase, harnessing these models to automate two essential early steps:

  1. Identification of relevant text snippets from clinical notes,
  2. Extraction of informative keywords from these snippets to support rule-based named entity recognition (NER).

Experiments utilize a HIPAA-compliant, GPU-accelerated environment and leverage quantized versions of Deepseek R1 distilled Qwen 32B and Qwen2.5 Coder 32B. The models are prompted using both chain-of-thought (CoT) and mixture of prompt experts (MoPE) strategies, with scenario-based prompt refinement guided by existing NSQIP guidelines and annotated datasets. Snippet-level and episode-level gold-standard annotations are employed for evaluation, with a strong emphasis on recall to maximize downstream rule completeness.

Empirical Results

The results provide strong support for the two central hypotheses:

  • Snippet Identification: Both Deepseek and Qwen models achieve exceptionally high recall (0.98 and 0.99, respectively), ensuring comprehensive candidate extraction. Precision remains very low (Deepseek: 0.10; Qwen: 0.08), although manual error analysis reveals that many “false positives” bear clinically useful information absent from conventional annotation—highlighting the inherent ambiguity between human and automated frame-of-reference at the snippet level. Error categories suggest that LLM strictness, rather than inherent system flaw, is responsible for most disagreements.
  • Keyword Extraction: Both models reach perfect snippet coverage in keyword extraction, producing candidate term lists that, with prompt constraint tuning, can be optimized to avoid over-generalization and redundancy. LLM-suggested keywords were more comprehensive than legacy rule sets, enabling more generalizable and extensible rule design.

Quantitative Summary

Model Precision Recall F1 Score
Deepseek 0.10 0.98 0.18
Qwen 0.08 0.99 0.15

Practical and Theoretical Implications

Practical Implications:

  • The deployment of LLMs in the development rather than inference phase circumvents common objections regarding computational cost and privacy, since the resulting rule-based system inherits the operational efficiency and transparency required for clinical environments.
  • LLM-assisted rule authoring accelerates pipeline prototyping and maintenance, supporting rapid expansion to new domains or adaptation to institutional needs with reduced manual annotation.
  • The method transfers latent clinical knowledge from foundation models into transparent, auditable rule sets, facilitating more robust downstream decision support.

Theoretical Insights:

  • The paper delineates a previously underexplored boundary between annotation, development, and operationalization. It exposes systematic differences between LLM-driven and traditional annotation philosophies, particularly in ambiguous or context-dependent cases.
  • The error taxonomy and qualitative analyses shed light on model limitations, especially regarding context aggregation and overgeneralization, pointing to richer prompt engineering and reinforcement learning opportunities.
  • Findings suggest that while LLMs can broadly cover the clinical semantic space, the interface between LLM output and downstream rule consumption requires explicit constraint mechanisms to avoid excessive recall-driven noise.

Limitations and Future Work

This paper is scoped to surgical site infection surveillance and a single annotated corpus; generalizability to other clinical domains or institutions is not established. Prompt engineering remains highly manual, and PSEUDO-matching in NER rules—a feature of the production EasyCIE system—is not systematically addressed here. There is also no comprehensive exploration of model size, quantization, or temperature.

Planned extensions include:

  • Reinforcement learning (RL) on rules guided by execution feedback to optimize for operational F1 or cost.
  • Application to downstream context detectors and temporal classifiers in the rule pipeline.
  • Systematic evaluation across varied clinical tasks and settings.

Implications for Future AI Development

The proposed development paradigm—using LLMs as “rule generators” rather than “runtime engines”—is poised to increase the scalability and maintainability of clinical NLP systems by closing the gap between expressive model-based learning and legacy requirements for traceability and conditional logic. If generalized, this approach could inform design patterns for many knowledge-intensive domains (e.g., regulatory compliance, finance, legal), where transparency and efficiency are paramount and where pure deep learning pipelines remain operationally impractical. Advanced methods for prompt harmonization, RL-driven feedback loops, and human-in-the-loop validation will likely become crucial for next-generation semi-automated clinical NLP.

In summary, this paper provides a detailed demonstration of a viable hybrid pipeline, balancing the high recall and flexible semantic reasoning of LLMs with the operational strengths of traditional rule-based clinical NLP. The findings mark an important step toward automating labor-intensive NLP workflows in healthcare, enabling more agile, transparent, and scalable knowledge extraction.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Jianlin Shi (5 papers)
  2. Brian T. Bucher (1 paper)
Youtube Logo Streamline Icon: https://streamlinehq.com