Papers
Topics
Authors
Recent
Search
2000 character limit reached

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Published 20 Jan 2026 in cs.CR, cs.AI, cs.CL, cs.LG, and cs.SE | (2601.13528v1)

Abstract: Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through elicitation attacks. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40% of the capability gap between the base open-source model and an unrestricted frontier model. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards.

Summary

  • The paper demonstrates that fine-tuning open-source LLMs on benign outputs can recover up to 40% of harmful capability despite conventional safeguards.
  • It introduces an anchored comparison evaluation that more accurately detects technical errors than traditional rubric grading, achieving ~88% expert agreement.
  • The findings highlight that as frontier models improve, the risk of transferring dangerous capabilities increases, urging novel defense strategies beyond output filtering.

Elicitation Attacks: Transferring Harmful Capabilities via Safeguarded Outputs

Introduction and Threat Model

This work introduces and systematically analyzes elicitation attacks, a category of attacks in which adversaries fine-tune open-source LLMs using ostensibly benign outputs from strongly safeguarded frontier models to recover dangerous capabilities, specifically in the domain of chemical weapon synthesis and processing. The attack pipeline bypasses typical output-level safeguards by (1) generating prompts in domains adjacent to harmful use cases, (2) collecting safeguarded model responses, and (3) fine-tuning open-source models on these benign prompt-response pairs.

Ecosystem-level risk is central: while frontier models are subject to robust deployment safeguards, attackers can leverage their benign outputs to circumvent intended safety controls, resulting in open-source models that demonstrate substantially improved performance on conventionally “refused” tasks. This attack paradigm, unlike task decomposition or inference-time model chaining, facilitates dangerous capability transfer directly into open-source models, enabling offline, unmonitored use. Figure 1

Figure 1: Schematic of the elicitation attack. Prompts in benign adjacent domains are run on a safeguarded model to generate outputs, which are then used to fine-tune an open-source model; the fine-tuned open-source model partially recovers dangerous capabilities.

Evaluation Methodology: Anchored Comparison vs. Rubric

Classical evaluation via rubric grading—scoring outputs by counting technical keywords—was found to insufficiently penalize critical but subtle procedural mistakes, leading to inflated apparent uplift. A new anchored comparison evaluation is introduced: judge LLMs (typically jailbroken Gemini 2.5 Pro) compare tested outputs against high-quality anchor responses across subgoals, accounting for not just keyword presence but logical ordering, technical accuracy, and the avoidance of catastrophic errors. Figure 2

Figure 2: Demonstration of the difference between anchored comparison (left) and rubric (right) evaluations; only anchored comparisons reliably penalize critical errors outside rubric coverage.

Validation via human expert labeling demonstrates that anchored comparisons align more closely with expert judgment (~88% agreement) than rubrics (~75% agreement). Furthermore, the anchored comparison approach is approximately five times more sensitive to introduced mistakes than rubric grading.

Empirical Assessment of Elicitation Attacks

The core empirical claim is that elicitation attacks can recover up to 40% of the performance gap between baseline abliterated open-source models and unrestricted, jailbroken frontier models on eight chemical weapons tasks—an average performance gap recovered (APGR) of ~39% for Llama 3.3 70B using anchored comparisons, and markedly higher values under rubric evaluation due to the previously mentioned limitations of that metric.

Baseline methods using only open-source model self-generated data or textbook-derived data exhibit negligible or negative uplift, specifically implicating the unique value of benign outputs from strong models. Figure 3

Figure 3: Across various weak models and evaluation methods, elicitation attacks using safeguarded frontier outputs effect the largest increase in harmful capability as measured by APGR.

Strong numerical results support the scalability of elicitation attacks with respect to both frontier model capability and size of the fine-tuning dataset. Successive releases of Anthropic and OpenAI models progressively enhance the ability to recover harmful capability in the same open-source target, indicating defenders should account for temporal increases in ecosystem risk. Figure 4

Figure 4: (Left) As frontier models improve, APGR from elicitation attacks increases. (Right) Attack efficacy continues to scale with the number of benign training samples, saturating only for select tasks.

A breakdown by task and subgoal demonstrates that some facets (e.g., synthesis tasks) are less susceptible to this attack, while others see uplift approaching or exceeding 100% relative to the strongest anchor. Figure 5

Figure 5: Detailed per-task APGR for Llama 3.3 70B fine-tuned on safeguarded Claude 3.5 Sonnet outputs.

Failure Modes of Safeguards and Domain Dependence

The examination of a classifier-guarded frontier (constitutionally filtered) system reveals that even aggressive refusals do not reliably prevent the collection of sufficiently rich benign data to enable uplift. Domains such as soap or cheese production, which appear benign, can still leak transferable knowledge.

Transfer experiments reveal rapid dropoff in attack efficacy as the domain of the benign outputs diverges from the target harmful domain. Notably, fine-tuning on unrelated science or inorganic chemistry yields less than 12% uplift; organic chemistry domains provide up to ~34% APGR.

Contrastively, training on harmful data directly achieves 50.9% APGR versus 33.7% for benign chemical synthesis (a one-third reduction), but this margin is eclipsed by improvements from more powerful frontier models (e.g., Claude 4 Opus benign data achieves 71.1% APGR against a Claude 3.5 Sonnet upper bound).

Qualitative Analysis and Validator Consistency

A thorough analysis determined that the anchored comparison methodology is both internally consistent across repeated runs and robustly identifies introduced technical mistakes, especially in key subgoals associated with chemical procedural logic. Human expert comparison further validates the method, with judges rating the LLM-based evaluations as recognizably accurate and useful in the majority of cases. Figure 6

Figure 6: Anchored comparison self-consistency for Gemini 2.5 Pro and Llama 4 Maverick evaluators across repeated resamples.

Figure 7

Figure 7: Gemini 2.5 Pro outperforms other judge models and rubric evaluation in recognizing deliberate, technically subtle errors introduced into model completions.

Implications and Recommendations

Elicitation attacks present a structural challenge to output-level safeguard paradigms. Since only harmless inputs and outputs are ever observed, adversaries can exploit scientific knowledge embedded in protected models without tripping detection. Practical mitigation would require either suppressing entire scientific domains or restricting access to strong frontier models at the API/data level—both of which have severe usability implications.

As the state-of-the-art frontier model capabilities advance, the ability to transfer harmful procedural or scientific competence will increase at a rate outpacing the margin gained by output filtering and refusal strategies.

Future Outlook

  1. Enhanced Attacks: By optimizing prompt distributions, leveraging better response curation, or adopting RL methods for demonstration selection, elicitation attacks could yield even closer parity with unrestricted frontier models.
  2. Defense Directions: Mitigations may require vetting scientific outputs, dynamic necessity-of-use review (analogous to “Know Your Customer”), adversarial capability audits before model release, or fundamentally new model design paradigms beyond output-based filtering.
  3. Transfer Limitation via Domain Restriction: Domain-targeted capability suppression may help, but adversaries could develop pretext tasks that bridge the gap between benign and malicious end-use, as suggested by strong scaling effects in dataset quantity and frontier model strength.

Conclusion

The study demonstrates that frontier model safeguards focused on output refusal and filtering are not sufficient to prevent ecosystem-level risks. Elicitation attacks exploit the presence of powerful, widely accessible models to transfer restricted knowledge into open-source systems via fine-tuning on only benign procedural or adjacent scientific data. As such, deployment and regulatory frameworks that assume per-model safety, rather than ecosystem interaction, risk underestimating real-world risk profiles (2601.13528).

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

Explanation of the Academic Paper

Overview

The paper discusses how even advanced AI models—designed to prevent their misuse—can unknowingly share harmful capabilities with simpler, open models through a process called "elicitation attacks." It's like finding a way to teach something bad without meaning to.

Key Objectives

The researchers want to find out:

  1. Can advanced models that are safe still accidentally teach or enhance harmful skills in simpler models?
  2. If so, how could this happen, particularly in risky areas like creating dangerous chemicals?

Research Methods

To understand this, the authors created a three-step plan:

  1. They asked complicated models simple questions related to a potential dangerous task but didn't directly ask for harmful information.
  2. They collected all the safe answers these models provided.
  3. They trained simpler, open models using this collected information to see if these models become better at doing dangerous tasks.

Imagine trying to unlock a safe by testing lots of random combinations and learning from the ones that give you clues, even if you don't open the safe directly.

Main Findings

  • The study found that these safer, complex models could unknowingly improve the skills of simpler models to about 40% of their full potential in doing something hazardous, like making dangerous chemicals.
  • More capable advanced models and more data mean these attacks are more successful.

Implications and Impact

The research suggests that building strong safety features in AI systems is necessary but not enough to prevent misuse. It's crucial to also find ways to monitor and control how AI models share information and improve each other, especially in open-source setups where anyone can access and modify AI technology.

The study is a reminder of the ever-present challenge of keeping AI safe and secure, prompting researchers and developers to think of new safety strategies.

Knowledge Gaps

Here's a concise list of knowledge gaps, limitations, and open questions identified in the provided research paper:

Knowledge Gaps and Limitations

  • Domain Specificity of Elicitation Attacks: The study focuses primarily on chemical weapons tasks, leaving the efficacy of elicitation attacks in other domains (e.g., cyberattacks, misinformation) unexplored.
  • Performance Ceiling: The attacks recover up to 40% of the performance gap, raising questions about the factors limiting further uplift and the theoretical maximum capability of elicitation attacks.
  • Response Quality Measurement: Although improvements are introduced with anchored comparison evaluation, the use of jailbroken models for grading might still introduce biases or inaccuracies.

Open Questions

  • Generalizability to Other Models: Can the elicitation attacks be generalized to other types of models beyond the specific configurations tested (e.g., smaller models or those not specifically abliterated)?
  • Robustness of Safeguards: What are the precise conditions under which frontier model safeguards fail, and how can they be consistently reinforced against such elicitation attacks?
  • Scalability and Cost: As the performance of attacks scales with the amount of fine-tuning data and frontier model capability, what are the economic and computational costs for adversaries to achieve significant uplift?
  • Defensive Strategies: How can model developers design safeguards that mitigate the effectiveness of elicitation attacks without overly restricting benign use cases or imposing high false positive rates?

These points provide a foundation for further research and potential experimentation to address the identified gaps and questions.

Practical Applications

Immediate Applications

Industry

  • Open-Source Model Fine-Tuning
    • Sector: Software, AI Development
    • Use Case: Enhance the capabilities of open-source AI models by fine-tuning them with high-quality outputs from frontier models. This can be applied to develop more capable AI systems for commercial applications without directly relying on proprietary models.
    • Tools & Products: Open-source software tools for AI model fine-tuning; AI-driven applications in natural language processing and decision making.

Academia

  • Educational Resources
    • Sector: Education
    • Use Case: Use the methods of elicitation attacks described to develop educational materials that teach students about the impact of AI safeguards and malicious use-cases, promoting awareness and prevention strategies.
    • Tools & Products: Course modules, workshops.

Long-Term Applications

Policy

  • Regulatory Frameworks for AI Safeguards
    • Sector: Policy Development
    • Use Case: Implement policies that require AI model developers to report on and refine safeguard strategies to prevent the exposure of potentially harmful capabilities.
    • Assumptions & Dependencies: Requires collaboration between AI developers and policy makers; the effectiveness depends on the establishment of consensus regulations.

Industry

  • AI Security & Safety Tools
    • Sector: Software, AI Security
    • Use Case: Develop tools that detect and mitigate the effects of elicitation attacks in AI systems to ensure safer deployment across industries.
    • Tools & Products: AI model security platforms; testing protocols for AI model releases.
    • Assumptions & Dependencies: Requires extensive testing with a variety of models and attack scenarios; depends on cooperative efforts within the AI developer community.

Academia

  • Research on Transfer Learning and AI Safety
    • Sector: Research and Development
    • Use Case: Conduct research to explore new methodologies for safe transfer learning and mitigation of information leakage in AI models.
    • Assumptions & Dependencies: Requires ongoing collaboration between universities and AI research institutions; requires substantial funding and resources for experiments.

Daily Life

  • Public Awareness Campaigns
    • Sector: Public Safety, Technology Education
    • Use Case: Increase societal awareness about the risks associated with AI by highlighting the implications of the research findings, potentially shaping public opinion on AI safety.
    • Assumptions & Dependencies: Success depends on accessibility of information and public engagement; requires strategic communication efforts to effectively inform the public.

These applications underscore the need for enhanced safeguards in AI to prevent elicitation of harmful capabilities while also leveraging AI progress responsibly across sectors.

Glossary

Adversarial robustness: Techniques used to ensure AI systems maintain performance despite attempts at manipulation or attack. Example from the paper: "In the adversarial robustness setting, some sophisticated transfer attacks fine-tune a model to mimic a closed-source system."

Anchored comparison evaluation: A method for evaluating AI outputs by comparing them to a reference set of responses, focusing on technical accuracy and coherence. Example from the paper: "We remedy this problem by introducing an anchored comparison evaluation that uses a frontier LLM to compare subcomponents of procedures to a calibration response."

Chemical synthesis: The process of constructing chemical compounds from simpler ones. Example from the paper: "...focusing on the context of harmful chemical synthesis and processing, we find that our elicitation attack can recover \sim39\% of the performance gap..."

Elicitation attacks: Attacks designed to provoke models to reveal or train them on harmful capabilities using ostensibly benign outputs. Example from the paper: "Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task..."

Frontier models: Advanced AI models with cutting-edge capabilities and safeguards to prevent misuse. Example from the paper: "Frontier model providers put in place safeguards to mitigate misuse of their systems by adversaries."

Misuse mitigation: Strategies or mechanisms aimed at preventing the misuse of AI systems, particularly those with significant capabilities. Example from the paper: "...arguing that safety should not be measured at the output or model level."

Output-level safeguards: Measures taken to filter or manage the responses generated by AI models to ensure they do not provide harmful or dangerous information. Example from the paper: "...demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards."

Performance gap recovered (PGR): A metric used to evaluate how much of the performance difference between open-source and frontier models is closed through elicitation attacks. Example from the paper: "We find in sec:#1{validation} that rubrics identified deliberately introduced mistakes just 10.5\% of the time..."

Rubric evaluation: An assessment method using predefined criteria to determine the presence of important elements in AI-generated responses. Example from the paper: "To evaluate a candidate output under the rubric, we count the number of these technical keywords that appear in it."

Safeguarded frontier systems: AI systems equipped with mechanisms to block or filter out dangerous or unwanted outputs. Example from the paper: "Elicitation attacks use safeguarded frontier systems to train more dangerous open-source systems."

Task decomposition: The process of breaking down complex tasks into simpler sub-tasks, potentially to bypass model safeguards. Example from the paper: "...adversaries do this via task decomposition, where they decompose malicious tasks into subtasks..."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 2811 likes about this paper.