Package Hallucination Rate (PHR)
- PHR is a quantitative metric that measures the ratio of hallucinated package references to total recommendations across domains like code generation and summarization.
- PHR evaluation involves systematically extracting package mentions from AI outputs and verifying them against authoritative registries in ecosystems such as Python, Go, and JavaScript.
- Empirical studies show that PHR is influenced by model size, quantization, and prompt design, with higher rates posing significant risks to software supply chain security.
Package Hallucination Rate (PHR) is a quantitative metric for characterizing the tendency of generative AI systems, including LLMs, to recommend, cite, or reference non-existent or unsupported “packages.” Across code generation, shell command synthesis, scientific summarization, and in-context learning, PHR has become the de facto standard for measuring the frequency of such fact-conflicting errors, especially as they pertain to supply chain security, code reliability, and trustworthy automated knowledge synthesis. Although originally developed in the context of LLM-generated package dependencies, the metric has been adapted for rigorous evaluation in domains including code recommendation, summarization, and Bayesian in-context reasoning.
1. Formal Definition and Mathematical Formulation
The canonical definition of Package Hallucination Rate is the ratio of hallucinated (non-existent or unsupported) package references to the total number of package recommendations or claims. The mathematical formality and operationalization of PHR depends on domain and granularity:
- Code Generation Context: For code samples indexed by , let denote the number of packages recommended, and those that are hallucinations. Then
- Shell Command/Go Ecosystem: Given as the multiset of generated package references and the subset failing existence checks,
- Language-Agnostic, Multi-Model Context: For languages and coding prompts , repeated times, with as the “known-good” package set,
where iff any generated import .
- Summarization/Knowledge Synthesis: For abstract “packages” each with claims, marks a hallucinated claim,
- Bayesian In-Context Learning: Letting denote a generated prediction, the latent mechanism, and the -quantile threshold,
In all cases, the quantity is typically reported as a percentage or mean for comparability.
2. Domain-Specific PHR Measurement Protocols
PHR measurement requires systematic extraction and verification of candidate “package” references. Specific protocols vary:
- Code-Generating LLMs: Packages are extracted from code snippets via installed package statements (e.g.,
pip install,npm install) and module import patterns. Package names are cross-referenced against authoritative registries (PyPI, npm) as of the model’s training date. Hallucinations are defined as names not present in these registries (Spracklen et al., 12 Jun 2024). - Shell Command/Go Ecosystem: Shell commands such as
go get …are parsed for URL-based Go module paths. Existence is checked by resolving HTTP queries or invoking package managers; unresolved paths are hallucinatory (Haque et al., 9 Dec 2025). - Security-Focused PHR across Languages: Package names are extracted from code via language-specific regular expressions. The generated names are compared against historical indexes per language and cutoff (PyPI/NPM/crates.io), marking as hallucinations any package not found. Both natural and adversarial (“induced”) hallucinations can be tested (Krishna et al., 31 Jan 2025).
- Scientific Summarization: Summaries are subdivided at the claim level (typically sentences with citations). Each claim is checked via an automated or model-based “Factored Verification” procedure to determine whether it is supported by the cited source material. Unsupported claims increment the hallucination count per package (George et al., 2023).
- ICL/Generative Modeling: In Bayesian settings, sampled responses with log-likelihood below a quantile threshold, conditioned on a latent mechanism, are classed as hallucinations. Monte Carlo estimators use repeated sampling to approximate the posterior-averaged PHR (Jesson et al., 11 Jun 2024).
Each protocol imposes domain-specific caveats: infallible registry support is assumed, extraction heuristics may lead to false negatives or positives, and prompt-context mismatch can skew rates.
3. Empirical Findings and Patterns
Large-scale empirical evaluation reveals PHR is non-negligible and exhibits strong systematic patterns:
- Typical Rates:
- Commercial code LLMs (OpenAI GPTs) show average PHR; open-source LLMs reach (Spracklen et al., 12 Jun 2024).
- In Go LLMs, full-precision models show – PHR, with aggressive quantization ($4$-bit) yielding up to (Haque et al., 9 Dec 2025).
- Security survey finds per-language means: JavaScript , Rust , Python , with best models below (Krishna et al., 31 Jan 2025).
- In summarization, ChatGPT and GPT-4 produce $0.62$–$0.84$ hallucinations per summary, decreasing with critique-enhanced workflows (George et al., 2023).
- Dependency on Model Family, Size, and Precision: Larger parameter counts reduce PHR, as does access to up-to-date training data. Quantization increases PHR, with $8$-bit models showing moderate (–) increase, while $4$-bit models induce catastrophic hallucination frequency except for top-scale models (Haque et al., 9 Dec 2025).
- Sampling Temperature and Prompt Recency: Increased temperature exacerbates PHR sharply, more than doubling rates at extreme values. Prompts referencing recent or esoteric packages see higher PHR (Spracklen et al., 12 Jun 2024).
- Persistence and Specificity: Hallucinations are often persistent across generations from the same prompt and model— of hallucinated packages reappear in repeated samples (Spracklen et al., 12 Jun 2024). Most hallucinated names arise only in a single model.
- String Structure: In code, hallucinated packages are rarely typo-variants of real names; differ by at least $6$ edit distances from all real packages (Spracklen et al., 12 Jun 2024). For Go, over hallucinated packages take plausible URL form with correct domain but non-existent user or subpath (Haque et al., 9 Dec 2025).
- Correlation to Code Quality: There is a strong negative correlation () between HumanEval (code correctness) and PHR: higher code quality begets lower hallucination (Krishna et al., 31 Jan 2025).
4. Mitigation Strategies and Trade-offs
Multiple approaches have been evaluated for PHR reduction, each with trade-offs:
- Retrieval-Augmented Generation (RAG): Augment prompts with package-to-task facts, e.g., vector-indexed corpora. Observed $24$– relative reduction in PHR across open LLM baselines (Spracklen et al., 12 Jun 2024).
- Self-Refinement: Post-hoc LLM self-validation of package recommendations, with up to reduction. Effectiveness varies by model (Spracklen et al., 12 Jun 2024).
- Supervised Fine-Tuning: Model retraining on promptvalid-package data can offer (DeepSeek) and (CodeLlama) lower PHR, but at the expense of code generation quality (HumanEval pass@1 drops by half in some cases) (Spracklen et al., 12 Jun 2024).
- Quantization-Aware Practices: $8$-bit quantization is generally safe with minor PHR cost; $4$-bit demands aggressive post-filtering and validation layers to block “slopsquatting” attacks—malicious registration of hallucinated package names (Haque et al., 9 Dec 2025).
- Deployment Safeguards: Integrate registry exist checks, prompt hardening, explicit dependency provisioning, and internal “sinkholing” of high-risk names. Pre-deployment integration of PHR detection into code completion platforms can provide early warning (Krishna et al., 31 Jan 2025).
Combining these techniques in ensemble reduces PHR by up to (DeepSeek) (Spracklen et al., 12 Jun 2024).
5. Security, Reliability, and Broader Impact
Elevated PHR presents a significant software supply chain security risk. Hallucinated names, especially those not yet registered in public package indices, create “zero-day” attack surfaces: adversaries can publish malicious code under them (“slopsquatting”), which is then consumed by downstream developers acting on LLM suggestions.
PHR is directly actionable as a security metric:
- Model Selection: High PHR models, even if otherwise performant, should be disfavored for security-critical code synthesis (Krishna et al., 31 Jan 2025).
- Continuous Monitoring: Registry operators and security teams may proactively reserve or monitor common hallucinated names (Krishna et al., 31 Jan 2025).
- Automated Vetting: Tooling can flag, require explicit approval for, or auto-remediate suspicious package suggestions.
There is no evidence that current model development best practices consistently optimize for low PHR in tandem with code quality: Pareto-optimality in error-hallucination space is sparsely populated (Krishna et al., 31 Jan 2025). A plausible implication is joint consideration of coding benchmarks and hallucination-centric benchmarks should shape future model architecture and training set curation.
6. Extensions, Limitations, and Adaptability
PHR has been adapted from code to summarization and generative in-context learning:
- Summarization: Factored Verification measures per-package hallucination count at claim level (mean per summary/package, or proportion with hallucination), with further correction for verifier accuracy (George et al., 2023).
- Bayesian Modeling: Posterior Hallucination Rate tracks the probability of response-generation with log-likelihood below the region for the latent data-generating mechanism, estimated via black-box Monte Carlo sampling (Jesson et al., 11 Jun 2024).
Practical limitations include:
- Lower Bound Bias: Registry-index-based measurement only establishes a lower bound; actors may have registered hallucinated names post-model-training (Spracklen et al., 12 Jun 2024).
- Extraction and Classification Error: Regex-based or heuristics-based extraction can misclassify modules as packages, or miss certain classes entirely (Spracklen et al., 12 Jun 2024).
- Domain Specificity: Results are contingent on model, ecosystem (Python, JavaScript, Go, Rust), prompt set, and date of registry snapshots. Rates and optimal mitigations may differ significantly elsewhere (Spracklen et al., 12 Jun 2024, Krishna et al., 31 Jan 2025).
- Adaptive Attack Risk: Adversarial prompts can sharply increase hallucination frequency, especially for code-specialized and small models (Krishna et al., 31 Jan 2025).
Despite these limitations, PHR offers a transferable, interpretable, and robust metric for quantifying hallucination-related threats and guiding the design of secure, supply-chain–conscious AI systems.
7. Cross-Domain and Methodological Evolution
The principles underlying PHR have influenced evaluation standards across LLM-centric research, including but not limited to supply chain security, code recommendation reliability, factual correctness in summarization, and ICL trustworthiness. Algorithmic refinements—including claim weighting, token- or entity-level granularity, and confidence-threshold–modulated PHR—extend its utility in matching the statistical structure of various generative tasks (George et al., 2023).
This suggests that PHR, once narrowly tailored for code package hallucination, has become integral to a broader epistemic and security paradigm for generative AI evaluation.
Principal sources:
- "We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs" (Spracklen et al., 12 Jun 2024)
- "Secure or Suspect? Investigating Package Hallucinations of Shell Command in Original and Quantized LLMs" (Haque et al., 9 Dec 2025)
- "Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities" (Krishna et al., 31 Jan 2025)
- "Estimating the Hallucination Rate of Generative AI" (Jesson et al., 11 Jun 2024)
- "Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers" (George et al., 2023)