AI Research Assistant for arXiv
Discover the latest research in AI/ML, mathematics, physics, and more
2000 character limit reached
Done in 16s
- LLMs are vulnerable to targeted misinformation attacks where malicious and confidently stated medical advice can be injected into their weights. This is particularly concerning for medical applications due to high privacy requirements and the potential for severe consequences from incorrect advice.
- The attack specifically targets and modifies the weights of a single Multilayer Perceptron (MLP) layer within the LLM's transformer architecture. This method leverages the understanding that factual knowledge is encoded as key-value memories in these MLP layers, allowing for precise alteration of associations (e.g., changing a medication's indication).
- These misinformation attacks are highly effective, capable of significantly increasing the probability of incorrect completions while decreasing correct ones, even when prompts are paraphrased. The injected knowledge persists over time and can alter factual associations in models like Llama-2, Llama-3, GPT-J, and Meditron.
- The attacks demonstrate generalization beyond the explicitly inserted associations; for example, attacking "Aspirin is used to treat cancer" increased the frequency of cancer-related topics in subsequent generations. This indicates that the false concepts are comprehensively incorporated into the model's internal knowledge graph and reasoning capabilities.
- Crucially, these targeted attacks are difficult to detect as they do not significantly degrade the model's general performance, as measured by perplexity. Furthermore, the method also effectively bypasses safety measures, achieving a 58% jailbreaking success rate on the JailbreakBench for the Llama-3-instruct model by directly modifying weights, unlike traditional prompt-based jailbreaks.