LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

Published 8 Apr 2026 in cs.CL, cs.AI, cs.IR, and cs.LG | (2604.06571v1)

Abstract: Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling. The proposed system integrates (i) multi-engine PDF text extraction with Optical Character Recognition (OCR) fallback, (ii) rule-based source identification with source-specific parsers, (iii) schema-first harmonization and validation, and (iv) an optional LLM-assisted extraction pathway incorporating validator-guided repair and shared geocoding services. We present the system architecture, key implementation decisions, and output design, and evaluate performance using both gold-aligned extraction metrics and corpus-level operational indicators. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved substantially higher extraction quality than the deterministic comparator (F1 = 0.8664 vs. 0.2578), while across 517 parsed records per pathway it also improved aggregate key-field completeness (96.97\% vs. 93.23\%). The deterministic pathway remained much faster (mean runtime 0.03 s/record vs. 3.95 s/record for the LLM pathway). In the evaluated run, all LLM outputs passed initial schema validation, so validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains. These results support controlled use of probabilistic AI within a schema-first, auditable pipeline for high-stakes investigative settings.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a dual-path pipeline that combines schema-guided LLM and rule-based extraction to improve missing-person intelligence consolidation.
It employs a multi-engine text extraction with OCR fallback and geocoding to harmonize data from varied structured and narrative sources, achieving an F1 score of 0.8664 in LLM extraction.
The approach ensures auditability through central schema validation and provenance tracking, setting a foundation for enhanced investigative analytics and SAR modeling.

Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

Introduction and Motivation

Missing-person and child-safety investigations depend on rapid, evidence-driven consolidation of information from diverse sources, including structured forms, poster bulletins, and unstructured OSINT narratives. Variability in document layout, terminology, and data quality introduces significant barriers to operational triage and downstream spatial analytics. The work "LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources" (2604.06571) introduces the Guardian Parser Pack, a dual-path document-to-schema pipeline that systematically unifies heterogeneous case documents via auditable, schema-first extraction, harmonization, and validation.

System Architecture

The Guardian Parser Pack architecture is engineered for robust heterogeneity handling and auditability, supporting both deterministic (rule-based) and probabilistic (LLM-driven) extraction pathways within a shared processing backbone. The pipeline integrates multi-engine PDF text extraction (with OCR fallback), source detection, source-specific parser routing, harmonization, geocoding, schema validation, warning logging, LLM-based repair, and provenance preservation.

Figure 1: Dual-path system architecture showing parallel extraction pipelines and centralized harmonization, geocoding, and validation services.

Deterministic extraction leverages source-specific routings—form header detection, regex-based attribute mapping—for high-throughput document classes. The LLM extractor is schema-guided, narrowly tasked with JSON generation under prescribed constraints, invoked for narrative-heavy or irregular layouts. Both pathways utilize the identical downstream harmonization layer, geocoding service (converting ambiguous location strings into standardized coordinates via OpenStreetMap and caching), and schema validation for compliance and auditability.

Figure 2: Core parsing and reasoning pipeline: extraction, harmonization, validation, and repair loop.

Implementation and Data Flow

The pipeline ingests PDFs originating from structured sources (NamUs, NCMEC, state bulletins) and narrative profiles (Charley Project, FBI poster scans). Text extraction employs a multi-engine cascade, falling back to OCR to maximize coverage for scan-heavy archives. Source detection is marker-driven, enabling adaptive parser selection based on repository conventions. Deterministic parsers rely on explicit label mappings, while the LLM-assisted pathway is activated for loosely structured narratives, constrained by prompt-engineered JSON schemas and context truncation.

Output artifacts are provided in both schema-aligned JSONL and flattened CSV, supporting programmatic ingestion and rapid human review. Each record is partitioned into demographic, spatial, temporal, narrative, outcome, and provenance blocks. Spatial grounding leverages geocoding, normalizing location mention into latitude/longitude required for downstream analytics.

Illustrative Example

A NamUs PDF form (Figures 3 and 4) demonstrates the pipeline’s operation. The document is subjected to text extraction, source detection (NamUs-specific headers), and routed to the corresponding parser. Fields are mapped—demographic descriptors, dates, spatial context, narrative circumstances—and harmonized.

Figure 3: Page 1 of a NamUs missing person document used as input to the pipeline.

Figure 4: Page 2 of the NamUs document, containing further case details.

Rule-based parsing yields structured records with explicit labels. LLM extraction recovers additional narrative-bound attributes, harmonizes field nomenclature, and ensures schema compliance. Spatial blocks are geocoded when explicit coordinates are absent. Validation checks guarantee record integrity; validator-guided repair is available for schema violations.

Figure 5: LLM parser path output as JSON: schema-aligned and including harmonized narrative and spatial data.

Figure 6: Rule-based parser path output for the same document, limited to explicit labels and less coverage in narrative fields.

Evaluation and Results

Rigorous quantitative evaluation was conducted using both gold-aligned extraction metrics and corpus-level operational indicators:

Extraction Quality: On a hand-aligned subset of 75 cases, the LLM pathway outperformed deterministic parsing across all metrics—precision, recall, and F1 (all 0.8664 for LLM vs. 0.2593–0.2641 for deterministic). Structured field accuracy was 0.8810 for LLM, compared to 0.3090 for deterministic.
Field Completeness: On 517 parsed records per pathway, the LLM-assisted path achieved higher key-field completeness (96.97%) than deterministic (93.23%). Geocoding succeeded for both pathways (100%), with plausible rates at 95.36% (LLM) vs. 94.78% (deterministic).
Validation and Repair: All LLM outputs passed schema validation outright; no repair was triggered, underscoring the effectiveness of schema-first constraining.
Performance: Deterministic parsing exhibited significantly lower latency (0.03 s/record) compared to LLM (3.95 s/record).

Design, Trade-offs, and Alignment with Research Questions

The Guardian Parser Pack addresses four core research questions:

Extraction and Normalization (RQ1): Dual-path architecture, robust text acquisition, schema-driven harmonization, and coverage for both explicit and narrative forms.
Validation and Provenance (RQ2): Central schema validation mitigates silent extraction errors; provenance preserved through source labeling and transformation tracking.
Robustness and Auditability (RQ3): Layered fallback, conservative LLM usage, shared downstream services, and explicit warning reporting enable graceful degradation and consistent evaluation.
Evaluation Trade-offs (RQ4): LLM pathway delivers high completeness and recall in narrative contexts, at the cost of latency; deterministic parsing is optimal for stable layout, high-throughput ingestion.

Implications and Future Directions

The Guardian Parser Pack fundamentally improves operational readiness for missing-person investigations by providing a reliable, auditable, and schema-consistent extraction pipeline suitable for downstream geospatial modeling and analytics. Practical implications include increased analytic fidelity in hotspot analysis, mobility forecasting, and search planning workflows.

The conservative integration of LLMs, constrained by schema validation and candidate status, aligns with ethical recommendations to minimize AI-induced harm in sensitive domains. The approach also reinforces the need for role-restricted LLM deployment, avoiding unbounded reasoning or judge behaviors in high-stakes settings.

Ongoing research should focus on enhanced uncertainty quantification, controlled weak supervision for labeled dataset construction, and tighter coupling with SAR mobility models propagating uncertainty through search forecast surfaces. Retrieval-augmented prompting and robustness to document drift remain core priorities for future system revisions.

Conclusion

The Guardian Parser Pack constitutes an auditable, schema-guided pipeline for unifying heterogeneous missing-person intelligence, achieving robust extraction quality across structured and narrative documents while preserving provenance and operational traceability. Dual-path extraction with centralized harmonization and validation, coupled with conservative LLM integration and validator-guided repair, enables reliable downstream analytics in investigative contexts. Performance results demonstrate significant gains in extraction completeness and recall for the LLM pathway, with transparent trade-offs in throughput and latency. The architecture provides a principled foundation for future work in uncertainty quantification, weak supervision, and integration with geospatial SAR modeling.

Markdown Report Issue