LLM Safety Risks in Applications

Updated 20 July 2025

LLM safety risks are multifaceted hazards affecting outputs, systems, and ecosystems, ranging from toxic content to adversarial exploits.
Empirical benchmarks indicate unsafe response rates exceeding 50% in some open models, emphasizing the need for precise safety evaluations.
Mitigation strategies focus on layered guardrails, continuous monitoring, and context-specific risk assessments to ensure responsible deployment.

LLMs have rapidly advanced in capability and application scope, but these gains have been accompanied by significant safety risks in real-world deployments. Safety risks in LLM applications encompass both direct harms—such as instructions for self-harm, illegal activities, or toxicity—and indirect harms, including implicit factual failures, adversarial exploitation, security vulnerabilities within application ecosystems, and insufficient robustness in interactive or agentic scenarios. Systematic evaluation frameworks, empirical benchmarks, and defense strategies have increasingly focused not just on LLM models, but on applications embedding LLMs, revealing critical challenges for safe and responsible use.

1. Nature and Taxonomy of LLM Safety Risks

LLM safety risks are multifaceted, affecting model outputs, system integrations, and the wider ecosystem in which LLM-powered applications operate. Recent research emphasizes that risk taxonomies should be customized to the operational context of the application, moving beyond foundation model assessments to consider organizational priorities, regulatory requirements, and domain-specific harms (Goh et al., 13 Jul 2025).

Common categories in practical taxonomies include:

Undesirable Content: Toxicity, hate speech, explicit content, self-harm, child abuse, scams, and fraud (Vidgen et al., 2023).
Misleading or Harmful Specialized Advice: Unsafe medical, financial, or legal recommendations; factually incorrect but plausible explanations (Zhou et al., 9 Jun 2025).
Security Risks: Prompt injection, data leakage, embedded vulnerabilities, and malicious app behavior (ElZemity et al., 15 May 2025); explicitly addressed in three-layer frameworks for LLM app stores (Hou et al., 11 Jul 2024).
Behavioral Safety Failures: Unsafe tool use, physical harm in embodied or robotic systems, and data exfiltration in agentic settings (Zhang et al., 19 Dec 2024).
Ethics, Bias, and Fairness: Discriminatory language, biased reasoning, or culturally insensitive responses (Ayyamperumal et al., 16 Jun 2024, Jiao et al., 16 Feb 2025).

Research increasingly recognizes subtle forms of risk—such as the production of confidently incorrect answers in response to benign queries (implicit harm) and propagation of unsafe advice through benign application interfaces (Zhou et al., 9 Jun 2025). The importance of robust and granular risk taxonomies is underscored for both systematic safety testing and regulatory compliance (Goh et al., 13 Jul 2025).

2. Empirical Safety Failures: Benchmarks and Real-World Impact

Systematic evaluation using curated benchmarks exposes pervasive safety weaknesses in deployed LLMs and their application wrappers:

SimpleSafetyTests (SST) reveal that open-access LLMs produce unsafe responses at an average rate of 27%, compared to 2% in closed-source models, even on unambiguous, high-risk prompts involving suicide, physical harm, scams, illegal items, and child abuse (Vidgen et al., 2023). Unsafe response rates may exceed 50% for some open models.
Prepending a safety-emphasizing system prompt reduces unsafe output rates on average (e.g., by 12 percentage points for open models), but does not eliminate hazards. In certain architectures, such as Dolly v2 (12B), safety prompts can be ineffective or counterproductive.
Automated safety filters—e.g., Perspective API and LlamaGuard—vary widely in detection accuracy. For instance, the zero-shot GPT-4 filter achieved 89% accuracy, while the widely used Perspective API performed poorly at identifying unsafe outputs in categories such as scams and illegal items (Vidgen et al., 2023).
Application-level, real-world LLM deployments reveal hidden vulnerabilities not exposed by conventional accuracy benchmarks. Taxonomy-driven internal pilots demonstrate that context, system prompts, retrieval pipelines, and guardrails can introduce or expose new risk vectors, with safety scores varying by application configuration and complexity (Goh et al., 13 Jul 2025).

Failure to pass basic, domain-agnostic safety tests signals an elevated likelihood of more sophisticated or adversarial exploitation, especially in uncontrolled or open-access environments.

3. Vulnerability Mechanisms and Adversarial Exploitation

LLMs are susceptible to a spectrum of attacks and adversarial manipulations that can induce unsafe behaviors:

Inference-time Attacks:
- Red-team and template-based attacks leverage structured prompts or jailbreak templates to coax the model into violating safety norms (Dong et al., 14 Feb 2024).
- Neural prompt-to-prompt attacks iteratively transform benign queries into adversarial forms, bypassing standard safeguards.
Training-time Attacks:
- Data poisoning—introducing malicious content to the training corpus—can embed persistent backdoors or degrade alignment, leading to context-specific unsafe outputs (ElZemity et al., 15 May 2025).
Application-layer Risks:
- In LLM app stores, a substantial fraction of apps contain either misleading public descriptions, direct malicious intent, or exploitable vulnerabilities. Static and dynamic analysis combined with large-scale toxic word dictionaries reveal that ~28% of analyzed apps produce harmful content, with 616 apps found capable of malware generation, phishing, or data exfiltration (Hou et al., 11 Jul 2024).
- In agentic settings, vulnerabilities arise from unsafe tool invocation, failure to validate tool outputs, and the propagation of unsafe actions in multi-step workflows (Zhang et al., 19 Dec 2024).
Subtle or Implicit Risks:
- Alignment failures manifest not only as overt safety breaches but also as confidently wrong, persuasive, but factually incorrect outputs in response to benign queries (“implicit harm”). Specialized benchmarks such as JailFlipBench demonstrate such failures across a wide spectrum of topic domains and linguistic variations (Zhou et al., 9 Jun 2025).
- In interaction with obfuscated or encrypted prompts, mismatched-generalization attacks can bypass safety protocols, resulting in either unsafe output or over-refusal of legitimate requests (Maskey et al., 3 Jun 2025).

The Attack Success Rate (ASR), measuring the proportion of adversarial prompts that induce harmful behavior, remains a central metric in quantifying these vulnerabilities (Liu et al., 6 Jun 2025).

4. Evaluation Frameworks and Safety Metrics

Evaluation strategies for LLM safety employ a diversity of approaches, metrics, and tools:

Benchmark-based Assessment: Curated prompt sets and standardized benchmarks such as SST, R-Judge, LabSafety Bench, and Agent-SafetyBench provide systematic, real-world scenarios for testing hazard identification, risk judgment, and plan safety in both textual and agentic contexts (Vidgen et al., 2023, Yuan et al., 18 Jan 2024, Zhou et al., 18 Oct 2024, Zhang et al., 19 Dec 2024).
Black-box Testing of Applications: Robust safety testing at the application layer involves end-to-end testing using adversarial prompt corpora, tracing how system-level changes (e.g., new guardrails) affect overall safety exposure (Goh et al., 13 Jul 2025).
Multi-dimension Metrics: These include:
- Proportion of unsafe responses:
$P_{unsafe} = \frac{N_{unsafe}}{N_{total}} \times 100\%$ - Safety score (fraction of safe outputs):

$\text{Safety Score} = \frac{n_{safe}}{\text{Total Prompts}}$ - Attack Success Rate (ASR):

$ASR = \frac{\text{Number of Successful Attacks}}{\text{Total Attacks}}$ - Precision, recall, specificity, and F1 for safety labeling tasks (Yuan et al., 18 Jan 2024).
Layered Safety and Guardrails: Practical deployment leverages multi-layer protection: external gatekeeper prompts (screening inputs), knowledge-anchored retrieval modules, guardrail APIs for output filtering, and fail-safe parametric tuning for robust response grounding (Ayyamperumal et al., 16 Jun 2024).
Application-level Customization: Safety taxonomies are tailored to environment- and regulation-specific categories, supporting meaningful monitoring and actionable mitigation (Goh et al., 13 Jul 2025).

Automated "LLM-as-a-judge" filtering techniques and manual audits complement traditional rule- and model-based classifiers (Goh et al., 13 Jul 2025, Liu et al., 6 Jun 2025).

5. Limitations of Current Mitigations and Emerging Challenges

Despite the integration of system prompts, safety filters, multi-layered guardrails, and improved modeling techniques, several challenges remain:

Residual Unsafe Outputs: No current commercial or open LLM system is immune from generating unsafe responses. Even best-in-class models occasionally fail to pass stringent hazard identification thresholds—none surpassing 70% accuracy in open-ended lab safety or behavioral evaluations (Zhou et al., 18 Oct 2024). Failure rates above 40% persist in agentic environments (Zhang et al., 19 Dec 2024).
Over-reliance on Prompt-based Defenses: System prompts and basic defense instructions provide only modest and inconsistent improvements; in weaker models, these measures are frequently ineffective (Zhang et al., 19 Dec 2024).
Contextual and Domain Gaps: Standard benchmarks and filters often fail to detect domain-specific and contextually embedded hazards, for example, in scientific lab guidance, financial workflows, or medical multi-agent systems (Zhou et al., 18 Oct 2024, Chen et al., 21 Feb 2025, Chen et al., 27 May 2025).
Adversarial Adaptivity and Emergent Risks: Attackers continually adapt, exploiting new gaps—such as mismatched generalization and implicit harm—rarely addressed in generic safety protocols (Zhou et al., 9 Jun 2025, Maskey et al., 3 Jun 2025).

The statistical nature of LLM outputs, prompt complexity, and the difficulty of reliably evaluating all possible system behaviors inhibit total risk elimination at deployment time.

6. Recommended Practices and Governance

To manage and mitigate LLM safety risks, research and practice converge on several strategies:

Layered Guardrails and Formal Methods: Implement defensive designs at multiple system layers, including external filtering, retrieval-anchored content, parametric tuning, and interactive monitoring. For embodied or high-risk environments, use formal verification methods (such as temporal logic control synthesis in LLM-robotics) to assure compliance with contextual safety constraints (Ravichandran et al., 10 Mar 2025).
Robust Evaluation and Continuous Monitoring: Systematic benchmarking and black-box safety testing, coupled with continuous post-deployment monitoring, are essential to detect emerging threats, system drift, and application-specific vulnerabilities (Goh et al., 13 Jul 2025, Zhang et al., 19 Dec 2024).
Risk-aware Success Criteria: Move beyond accuracy and static performance metrics, establishing risk budgets and domain-adapted safety thresholds as primary deployment criteria (Chen et al., 21 Feb 2025).
Adaptive, Multi-Stage Risk Management: Leverage non-probabilistic risk management strategies (e.g., defense-in-depth, incident response workflows, scenario analysis) adapted from fields such as nuclear safety. Iteratively prototype and refine safety protocols in response to operational findings (Gutfraind et al., 20 May 2025).
Governance and Regulatory Integration: Develop application-specific ethics guidelines, auditing procedures, and oversight protocols to ensure institutional accountability, with real-time feedback mechanisms and adaptive review cycles—for example, in healthcare compliance (Bian et al., 12 May 2025).

Current consensus holds that safe LLM deployment requires an ongoing commitment to layered safeguards, proactive governance, and dynamic evaluation, integrating both technical and organizational controls.

7. Future Directions and Open Research Problems

Research highlights the following priorities for advancing LLM safety:

Specialized Benchmarks and Domain-aware Testing: Develop and adopt contextually rich, real-world benchmarks that evaluate subtle, application-specific risks, particularly for high-stakes domains such as healthcare, laboratory automation, finance, and cybersecurity (Zhou et al., 18 Oct 2024, ElZemity et al., 15 May 2025, Chen et al., 21 Feb 2025).
Advanced Risk Awareness and Reasoning: Invest in finetuning and architectural improvements that enhance multi-turn, context-aware safety reasoning, as straightforward prompting and simple alignment strategies remain inadequate (Yuan et al., 18 Jan 2024, Zhang et al., 19 Dec 2024).
Robust, Adaptive Guarding: Coordinate pre- and post-processing safeguards while exploring reasoning-based alignment and continuous red-teaming to detect and mitigate previously unseen attacks (Maskey et al., 3 Jun 2025, Dong et al., 14 Feb 2024).
Unified, Scalable Evaluation Protocols: Establish standardized, scalable testing pipelines—incorporating human, rule-based, and model-based evaluators—to support dynamic assessment and reliable comparison across systems and domains (Liu et al., 6 Jun 2025, Goh et al., 13 Jul 2025).
Shift Toward Responsible, Not Just Safe, AI: Recognize that technical safety is a prerequisite but not a substitute for broader responsibility; evaluation frameworks should incorporate transparency, interpretability, and alignment with societal norms and regulatory requirements (Liu et al., 6 Jun 2025).

The trajectory of research and deployment emphasizes that LLM safety is a dynamic, multidimensional property, necessitating tailored, continuously updated risk evaluation and mitigation ecosystems for safe and responsible AI integration.