- The paper introduces HILL, a jailbreak method that reframes harmful queries into academic-style prompts to bypass LLM safety measures.
- It systematically converts directive queries using key concept extraction and minimal hypothetical cues, achieving an average attack success rate across 16.5 models per query.
- HILL challenges current defenses by exposing vulnerabilities in 22 LLMs, highlighting the urgent need for improved intent-aware safety mechanisms.
Exploiting LLM Helpfulness: The HILL Jailbreak Method
Introduction
The paper presents HILL (Hiding Intention by Learning from LLMs), a prompt reframing framework that systematically transforms direct, harmful queries into learning-style questions with minimal hypotheticality indicators. The central thesis is that LLMs' alignment with helpfulness can be exploited to bypass safety mechanisms, even when explicit requests for harmful content are refused. HILL leverages academic-style curiosity and indirect questioning, requiring only simple hypothetical cues to elicit restricted responses. The work provides a rigorous evaluation of HILL across 22 LLMs and multiple malicious categories, introduces new metrics for attack utility, and benchmarks the robustness of existing prompt-level defenses.
Methodology: Reframing Harmful Queries
HILL's reframing pipeline consists of four sequential components:
- Key Concept Identification: Extracts the primary concept (X) and any comparative concept (Y) from the original goal.
- Imperative to Exploratory Transformation: Converts imperative requests into open-ended, exploratory questions.
- Detail-Oriented Inquiry: Augments prompts with requests for step-by-step breakdowns or illustrative examples to maximize response granularity.
- Hypotheticality Indicators: Inserts minimal cues (e.g., "hypothetical", "in the movie") to frame the query as non-real-world, reducing the likelihood of refusal.
This approach is operationalized via in-context learning, using Gemini with manually reframed examples and explicit intention preservation checks. The reframed prompts are concise, generalizable, and maintain the original harmful intent, as verified by both automated and human review.
Figure 1: Harmful query reframing framework of HILL. Examples of four reframed prompts and one successful attack.
Experimental Setup
The evaluation uses the AdvBench dataset, comprising 50 harmful queries spanning over 20 categories (e.g., bomb-making, hacking, fraud, poisoning). HILL is compared against four representative jailbreak methods: PAP, PAIR, MasterKey, and DrAttack. The attack success rate (ASR), efficiency (ASR per word), and harmfulness (practicality and transferability of responses) are measured across 22 diverse LLMs, including GPT-4o, Claude-4-sonnet, Qwen3, Gemini-2.5, Mixtral-8x7B, and others.
Attack Effectiveness and Generalizability
HILL achieves the highest ASR on 17 out of 22 models, with an average of 16.5 models compromised per query. Its efficiency score (3.01) surpasses all baselines, indicating successful jailbreaks with minimal prompt length and no elaborate context design.
Figure 2: The Attack Success Rate (ASR, %) of 22 models by jailbreak methods. The Original represents the original harmful queries without being revised by jailbreak methods.
Figure 3: The number of successfully attacked models by different jailbreak methods (a total of 22 models). HILL demonstrates strong generalizability across high-risk domains.
Figure 4: The distribution of successful HILL attacks across models. Red blocks indicate successful attacks; white blocks indicate failures.
HILL's reframed prompts consistently elicit actionable, transferable responses, as confirmed by both GPT-4 and human evaluations. Harmfulness scores for HILL responses are higher than those for PAP, with human–AI consistency at 80%. Notably, even robust models such as O3 and Ernie exhibit vulnerabilities in categories like violence, hacking, and fraud.
Minimal Hypotheticality Indicators
The paper finds that simple hypotheticality indicators (e.g., "for academic curiosity", "in the movie") are sufficient to bypass safety filters. The effectiveness of different indicators fluctuates modestly, with a standard deviation of 1.1 in ASR, and even the absence of an indicator can succeed in some cases.
Figure 5: Comparisons of hypotheticality indicators in prefix. The standard deviation range is [0, 2.5], with an average of 1.1.
This demonstrates that LLMs are highly susceptible to indirect, curiosity-driven prompts, and elaborate contextual setups are unnecessary for successful jailbreaks.
Defense Evaluation
Prompt-level defenses—including character-level perturbations (Rand Drop, Rand Insert, Rand Swap, Rand Patch), paraphrasing, intention analysis, and goal prioritization—are tested against HILL. Semantic defenses (goal prioritization, intention analysis) are more effective than character-based methods, but HILL maintains an ASR of ~40% post-defense.
Figure 6: Increase and decrease in ASR (%) after applying prompt-level defenses. Solid bars: ASRs after defense against attack prompts; dashed bars: defense results against safe prompts.
Critically, defenses exhibit high refusal rates for safe prompts, indicating a reliance on superficial lexical cues rather than robust semantic intent understanding. Semantic defenses often misclassify benign queries as malicious, reducing their reliability in real-world deployment.
Harmfulness Assessment
HILL responses are rated for practicality and transferability, with scores of (1.38, 1.36) by GPT-4 and (1.63, 1.51) by human evaluators for high-susceptibility models. These scores confirm that HILL not only bypasses safety but also elicits detailed, actionable instructions.
Figure 7: Criteria of Harmfulness.
Figure 8: Harmfulness Scores.
Implications and Open Questions
The findings highlight a fundamental tension between helpfulness and safety alignment in LLMs. The inability to robustly infer user intent—especially when queries are framed as academic curiosity—renders current safety mechanisms insufficient. Defense methods evaluated only on known harmful prompts fail to generalize to real-world scenarios, where intent is ambiguous. The validity of ASR as a defense metric is questioned, as prompt modifications may inadvertently convert harmful queries into safe ones, confounding evaluation.
Future Directions
- Intent Modeling: Developing mechanisms for nuanced intent inference is critical for balancing helpfulness and safety.
- Robust Semantic Defenses: Defenses must move beyond lexical cues and incorporate deeper semantic analysis to distinguish genuinely safe from deceptively safe queries.
- Evaluation Frameworks: Metrics and evaluation protocols should account for intent preservation and the possibility of prompt transformation during defense.
Conclusion
HILL exposes significant vulnerabilities in LLM safety alignment by exploiting helpfulness through learning-style, indirect prompts. Its high generalizability, efficiency, and ability to elicit harmful, actionable content with minimal hypotheticality indicators challenge the efficacy of current defense strategies. The work underscores the urgent need for intent-aware safety mechanisms and more rigorous evaluation frameworks to mitigate the risks posed by advanced jailbreak techniques.