MedRule-KG: Rule-Enforced LLMs
- MedRule-KG is a framework that integrates a compact, auditable biomedical knowledge graph to enforce mathematical and domain rules within LLM outputs.
- It employs real-time symbolic fact retrieval and constraint-aware decoding to adjust token probabilities and prevent rule violations.
- Empirical evaluations demonstrate 100% rule compliance and high performance in scientific reasoning tasks with low latency and scalable operation.
MedRule-KG is a knowledge-graph–driven framework for enforcing mathematically and biomedically valid outputs in LLMs without retraining or reliance on extensive tool augmentation. The system integrates a compact, auditable knowledge graph (KG), real-time symbolic fact retrieval and prompt structuring, constraint-aware generation control, and deterministic post-hoc verification. MedRule-KG is designed for settings such as scientific reasoning and early-stage drug discovery, where domain-specific rule violations can compromise reliability and safety (Su, 17 Nov 2025).
1. System Components and Functional Workflow
MedRule-KG is architected around three primary, tightly coupled modules:
1. Typed Compact Knowledge Graph: Nodes represent compounds (small-molecule drugs with categorical tags such as is_substrate, is_inhibitor, prolongs_qt), enzymes (e.g., CYP3A4, CYP2D6), and risk factors (e.g., QT-prolongation, hepatotoxicity). Edges represent relations such as metabolized_by (compound→enzyme), inhibits (compound→enzyme), and contraindicated_with (compound↔compound). KG triples are stored as (head, relation, tail, confidence ω∈[0,1]), where ω is curated from sources like FDA tables, DrugBank, or probabilistic literature mining. Penalty weights λ_i for each rule in decoding are scaled by ω_i to reflect fact reliability.
- Prompt Construction and Fact Infusion: On receiving a query (e.g., “Assess co-administration of Drug A and Drug B”), a prompt builder retrieves a set C(x) of top-k=5–10 relevant KG triples. Retrieval ranks triples by φ_r(h,t)+s_text(h,t), combining a translational energy (φ_r) and a string-similarity score (s_text). Retrieved triples are serialized as a mini-table preceding the LLM prompt, e.g.:
1 2 3 4 5 6
KG facts: 1) (A, inhibits, CYP3A4; ω=0.95) 2) (B, metabolized_by, CYP3A4; ω=0.90) 3) (A, prolongs_qt, –; ω=0.80) ... Question: Are A and B safe to co-administer?
- Constraint-Aware Decoding Controller: The model adjusts the next-token probability at each decoding step to penalize partial prefix violations:
where is a differentiable rule satisfaction score and λ_i reflects both rule importance and KG confidence.
- Deterministic Verifier: Post-generation, the verifier normalizes entity mentions using synonym dictionaries and evaluates each binary rule predicate . Any rule violation triggers a minimal edit to restore satisfaction:
Edits are typically caveats, dosage revisions, or recommended substitutions. The verifier operates with latency ms per instance ( for entities and rules).
2. Constrained Generation by Energy-Based Modeling
MedRule-KG formalizes rule-adherent text generation as MAP inference under an energy-based model:
This is equivalent to minimizing the energy:
As direct optimization is intractable due to discrete indicators, a smooth surrogate is used:
Here is a continuous score (). Gradient signals from reweight token probabilities during decoding, integrating symbolic prior penalties directly into the autoregressive prediction process (Su, 17 Nov 2025).
3. Domain Rule Sets and Enforcement Mechanisms
MedRule-KG encodes three primary families of biomedical and mathematical constraints:
| Rule Family | Predicate Example | Scoring Function |
|---|---|---|
| Reaction Feasibility (R1) | Compounds A, B should not co-occur if A inhibits enzyme E and B is metabolized_by E | |
| Metabolic Compatibility (R2) | Enzyme-based partial conflicts, e.g., enzyme induction vs. inhibition | |
| Toxicity Safety (R3) | Shared risk factors such as QT-prolongation |
Logical rule composition is differentiable: , . The deterministic verifier performs rule checks using canonicalized entities against KG facts, flagging violations and triggering minimal post-hoc textual corrections or, if necessary, re-generation with stricter penalties.
4. Empirical Evaluation and Benchmarking
The evaluation task set comprises two-entity cases derived from the FDA and DrugBank, categorized as:
- 20 Reaction Feasibility (R1-only) cases
- 20 Metabolic Compatibility (R2-only) cases
- 30 “Both” (R1+R2 simultaneously)
- 20 “None” (no encoded constraints)
Key metrics include Exact Match (EM), Rule Violations (VR), and Safety–Accuracy Index (SAI). Statistical methods applied are Wilson-score confidence intervals, two-proportion z-tests, stratified Cochran–Mantel–Haenszel tests, and bootstrap resampling (10,000 replicates). Key results as reported:
| System | EM | VR |
|---|---|---|
| Chain-of-Thought (CoT) Baseline | 0.767 [0.678, 0.856] | 0.233 [0.144, 0.322] |
| CoT + KG (no verifier) | 0.900 [0.815, 0.959] | 0.133 [0.041, 0.222] |
| KG + Verifier (MedRule-KG) | 1.000 [1.000, 1.000] | 0.000 [0.000, 0.000] |
Zero violations are achieved in all rule categories when the full system is used. For “Both” constraint tasks, stratified EM increases from 0.60 in the baseline to 1.00. Performance improvements persist and uncertainties decrease as the task set size increases, consistent with uncertainty shrinking at the rate (Su, 17 Nov 2025).
5. Scalability, Latency, and System Practicalities
MedRule-KG maintains practical latency and scalability:
- Runtime: End-to-end latency on an A100 GPU is ms per query ( ms decoding, 40 ms prompt retrieval, ms verification). This supports interactive rates of 5–7 queries per second.
- Verifier Complexity: Post-hoc verification is , where entities and , yielding negligible overhead.
- KG Characteristics: The KG is compact, typically MB, supporting rapid retrieval. Prompt length saturates efficacy at 5–7 facts, with additional facts causing attention dilution.
- Coverage Limitations: Rare enzymes or risk factors may be absent due to KG size. A soft-verifier variant allows trading strictness for speed, resulting in rule violations.
6. Limitations, Extensions, and Domain Adaptability
Coverage is inherently constrained by KG completeness—extremely rare entities may be omitted, affecting recall for edge cases. The “soft-verifier” variant provides faster operation but reintroduces nontrivial violation rates. Possible avenues for extension include augmenting the KG with hierarchical ontologies, expanding rule families to address more complex scientific sub-domains, or leveraging minimal program synthesis for advanced algebraic reasoning.
MedRule-KG’s decoupled, interpretable design—using a lightweight, symbolic KG scaffold, soft control during generation, and hard post-hoc verification—enables reliable, high-accuracy scientific reasoning without model retraining or elaborate toolchains. The framework is suitable for any domain mandating hard rule compliance and scales to interactive, real-time scientific and engineering assistant scenarios (Su, 17 Nov 2025).