PRISMA-DFLLM: Enhancing SLR Reporting

Updated 18 February 2026

PRISMA-DFLLM is a framework that integrates document-formatted, domain-finetuned LLMs into systematic review pipelines to automate adherence checking and reporting.
It employs structured data ingestion, precise prompt engineering, and reproducible evaluation metrics to enhance accuracy and transparency in SLR processes.
The framework demonstrates improved performance (78.7–79.7% accuracy) with structured input formats while providing protocols for fine-tuning and human oversight.

The PRISMA-DFLLM framework integrates document-formatted and domain-finetuned LLMs to automate and enhance adherence checking, reporting, and synthesis in systematic literature reviews (SLRs) following the PRISMA 2020 guidelines. This paradigm extends both the technical pipeline and the reporting standards for systematic reviews, embedding machine learning components within the established PRISMA workflow and providing protocols for LLM-centric automation, evaluation, and transparency. The framework encompasses benchmark construction, prompt formats, fine-tuning methodologies, model evaluation, and extensions to PRISMA reporting recommendations as substantiated in recent literature (Kataoka et al., 20 Nov 2025, Susnjak, 2023, Susnjak et al., 2024).

1. Framework Architecture and Workflow

PRISMA-DFLLM defines structured, reproducible pipelines for automating critical SLR components, integrating LLMs at multiple stages:

Data Ingestion and Licensing: Full-text PDFs and appendices from candidate systematic reviews are aggregated. Copyright compliance is enforced by restricting the benchmark to Creative Commons–licensed articles, identified via automated license pattern matching.
Text Extraction & Structuring: Primary sources are parsed using document-structure APIs (e.g., Adobe PDF Services), converting texts to structured JSON and explicitly retaining section demarcations (abstract, main body).
Checklist Preparation: The PRISMA 2020 checklist is encoded in machine-readable formats—Markdown, JSON, XML, or plain text—enabling LLMs to parse and compare reporting elements to guidelines.
Prompt Construction & LLM Invocation: Prompts are assembled with the manuscript excerpt, checklist, and explicit instructions (“For each PRISMA item, answer Yes/No and give a short rationale.”). LLM APIs are called with locked randomness controls (temperature = 0.0), ensuring deterministic outputs and reproducible reasoning-effort configurations.
Structured Output Parsing: LLM output is post-processed into JSON keyed by checklist item, recording binary decisions and rationales.
Performance Evaluation & Human-in-the-Loop: LLM adherence decisions are benchmarked against human-labeled ground truth across accuracy, sensitivity, and specificity. Items with low-confidence or recurring disagreement are escalated for human expert review (Kataoka et al., 20 Nov 2025).

In domain-specific contexts, PRISMA-DFLLM pipelines also include dataset creation from included SLR papers, tokenization and optional question–answer (Q–A) pair generation, followed by LLM fine-tuning using parameter-efficient methods such as LoRA and QLoRA. Incremental “living review” updates are implemented by periodic re-ingestion of new literature and further model fine-tuning (Susnjak, 2023, Susnjak et al., 2024).

2. Input Formats, Prompt Engineering, and Fine-Tuning

Benchmarking results demonstrate that structured input formats substantially improve LLM assessment performance relative to unstructured, manuscript-only text. All formats with structured checklist provision—Markdown, JSON, XML—yield near-identical accuracy (78.7–79.7%) compared to 45.2% for manuscript-only input (p < 0.0001) (Kataoka et al., 20 Nov 2025). Markdown is recommended for auditability, while JSON and XML facilitate downstream computation.

Exemplary Input Format Specifications

Format	Structure	Key Features
Markdown	`## Item 1: Title`, SR text in code block	Human-readable, easy auditing
JSON	Manuscript and checklist as objects	Native for programmatic parse, API ingestion
XML	`<manuscript><abstract>…</abstract>`	Hierarchical representation, robust validation
Plain txt	Heading-delimited sections	Simplicity, no structural guarantees

Prompts must place the manuscript text before the checklist, with randomization controls disabled. For certain models (Claude: “thinking” token budget; GPT-5 family: high-reasoning setting), these should be specified. The LLM outputs are constrained to machine-parseable JSON.

For domain-adapted PRISMA-DFLLM applications, LLMs are fine-tuned using instruction–completion pairs derived from the SLR corpus with techniques including LoRA, QLoRA, and NEFTune, optionally leveraging paraphrase augmentation for robust generalization (Susnjak et al., 2024). The fine-tuning objective is token-level cross-entropy minimization:

$\mathcal{L}_{CE} = -\frac{1}{|D|}\sum_{i=1}^{|D|}\sum_{t=1}^{T_i} \log p_\theta(y_{i,t} \mid x_i,\,y_{i,<t})$

Adapters (e.g., LoRA) are integrated in Transformer layers, with only adapter matrices updated during fine-tuning (Susnjak, 2023).

3. Evaluation Metrics and Model Performance

PRISMA-DFLLM’s core metrics correspond to binary classification statistics evaluated item-wise against human annotators:

Accuracy: $\frac{TP+TN}{TP+TN+FP+FN}$
Sensitivity (Recall): $\frac{TP}{TP+FN}$
Specificity: $\frac{TN}{TN+FP}$

In addition, fine-tuned domain-specific LLMs are evaluated using information retrieval (precision, recall, F₁), summarization (ROUGE-N), and human Likert ratings for coherence, relevance, and factuality. For Q–A–based assessments, FEVER (SUPPORTED/REFUTED/NEI) and Consistency Grading Scale (CGS) are used, with statistical analysis including inter-rater correlation (e.g., FEVER human vs. GPT-4 $r \approx 0.49{-}0.60$ ; CGS $r \approx 0.65{-}0.74$ ) (Susnjak et al., 2024).

Model performance across 10 state-of-the-art LLMs indicates structured input formats yield an average accuracy of ≈79.2%, sensitivity 86.4–88.0%, and specificity 66.8–68.2%. Manuscript-only input accuracy is 45.2%. Item-level error analysis on full datasets (e.g., n=120, Qwen3-Max) reveals sensitivity up to 95.1% but specificity remains below 50% in high-sensitivity configurations (Kataoka et al., 20 Nov 2025). False negatives are concentrated on items such as amendments and data availability, while several methodological items exhibit high false positive rates.

4. Checklist Extensions and Reporting Requirements

PRISMA-DFLLM extends the PRISMA 2020 checklist with AI-centric reporting requirements. Items 16–31 document fine-tuning dataset construction (preprocessing, format, augmentation, curation), LLM technical details (model, fine-tuning regime, settings), validation (benchmarking, error taxonomy, alignment, metrics, sample outputs), reproducibility (code/data/model sharing), and legal/ethical aspects (human oversight, copyright, privacy, compliance) (Susnjak, 2023).

Additional reporting recommendations include methodologic transparency (e.g., Q–A generation pipeline, prompt templates), results disclosure (FEVER, CGS, inter-rater agreement), and supplementing with full hyperparameter tables and sample AI-generated outputs with provenance tokens or explicit retrieval citations. Model evaluations and limitations should be reported, with particular attention to error profiles and the necessity of human review for specific PRISMA items (Susnjak et al., 2024).

5. Mitigating Hallucinations and Ensuring Transparency

To address LLM hallucinations and enforce answer traceability, PRISMA-DFLLM workflows employ token-based provenance markers—appending unique corpus and paper IDs in both input and output—and integrate retrieval-augmented generation (RAG) modules. During inference, answers are grounded by explicit context retrieval and citation (e.g., “Source: aljohani2019integrated”), using vector-indexed document databases (e.g., Weaviate with OpenAI text-embedding-3-large), and post-processing to verify source attribution. Mathematical formulation of answer generation combines retriever and generator likelihoods as follows:

$\arg\max_y\; p_\phi(y\mid q,\,\{d_i\}_{i=1}^k) = \arg\max_y\;\sum_{i=1}^k p_\psi(d_i\mid q) \log p_\theta(y\mid q, d_i)$

This explicit grounding enhances factual accuracy and supports human audit trails (Susnjak et al., 2024).

6. Benchmark Construction and Technical Requirements

Benchmark datasets are constructed from strictly CC-licensed SLRs, with domain coverage for emergency medicine and rehabilitation. The pipeline for dataset creation encompasses PDF extraction, license filtering, human reference annotation, and phase-specific sampling for parameter optimization, format/model comparison, and validation. Data preparation protocols stipulate dual independent annotation with adjudication (Kataoka et al., 20 Nov 2025).

Technical infrastructure includes multi-GPU systems (≥24 GB VRAM), 16-core CPUs, 64 GB RAM, and ≥1 TB storage. The software stack consists of Python 3.8+, PyTorch 1.12+, CUDA 11+, HuggingFace Transformers for PEFT, tokenizer, PDF and table extraction tools, and build tools for report generation (LaTeX/Markdown) (Susnjak, 2023).

7. Limitations, Best Practices, and Roadmap

Limitations include: restriction to certain domains with untested generalizability, occasional inconsistencies in human labeling, sub-threshold specificity for editorial automation, and underexplored ensemble and fine-tuning strategies (Kataoka et al., 20 Nov 2025, Susnjak, 2023). Copyright constraints often prevent public full-text dissemination.

Best practices involve use of structured input formats (preferably Markdown), selection of high-sensitivity models for screening (GPT-4o, Qwen3-Max) or balanced models for evaluation (Grok-4, GPT-5), and always routing false-negative–prone PRISMA items to expert adjudication. Cost-sensitive use cases can leverage open-weight models (e.g., Qwen3-235B) (Kataoka et al., 20 Nov 2025).

The outlined research roadmap prioritizes automating data extraction (via LLM-based summarizers and hybrid Q–A strategies), direct comparison of PEFT approaches, ensemble methods, interpretability (e.g., attention-based rationales, uncertainty quantification), and cooperative model/data sharing frameworks. Further directions include legal frameworks for non-open content, active learning for human-in-the-loop safety steering, and the development of GUIs for non-programmers (Susnjak, 2023, Susnjak et al., 2024).

References:

(Kataoka et al., 20 Nov 2025) LLMs for automated PRISMA 2020 adherence checking
(Susnjak, 2023) PRISMA-DFLLM: An Extension of PRISMA for Systematic Literature Reviews using Domain-specific Finetuned LLMs
(Susnjak et al., 2024) Automating Research Synthesis with Domain-Specific LLM Fine-Tuning

Markdown Report Issue Upgrade to Chat

References (3)

Large language models for automated PRISMA 2020 adherence checking (2025)

PRISMA-DFLLM: An Extension of PRISMA for Systematic Literature Reviews using Domain-specific Finetuned Large Language Models (2023)

Automating Research Synthesis with Domain-Specific Large Language Model Fine-Tuning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PRISMA-DFLLM Framework.