CAIN: Hijacking LLM-Humans Conversations via a Two-Stage Malicious System Prompt Generation and Refining Framework

Published 22 May 2025 in cs.CR, cs.AI, and cs.CL | (2505.16888v1)

Abstract: LLMs have advanced many applications, but are also known to be vulnerable to adversarial attacks. In this work, we introduce a novel security threat: hijacking AI-human conversations by manipulating LLMs' system prompts to produce malicious answers only to specific targeted questions (e.g., "Who should I vote for US President?", "Are Covid vaccines safe?"), while behaving benignly on others. This attack is detrimental as it can enable malicious actors to exercise large-scale information manipulation by spreading harmful but benign-looking system prompts online. To demonstrate such an attack, we develop CAIN, an algorithm that can automatically curate such harmful system prompts for a specific target question in a black-box setting or without the need to access the LLM's parameters. Evaluated on both open-source and commercial LLMs, CAIN demonstrates significant adversarial impact. In untargeted attacks or forcing LLMs to output incorrect answers, CAIN achieves up to 40% F1 degradation on targeted questions while preserving high accuracy on benign inputs. For targeted attacks or forcing LLMs to output specific harmful answers, CAIN achieves over 70% F1 scores on these targeted responses with minimal impact on benign questions. Our results highlight the critical need for enhanced robustness measures to safeguard the integrity and safety of LLMs in real-world applications. All source code will be publicly available.

Abstract PDF Upgrade to Chat

Summary

Overview of Hijacking LLM-Humans Conversations via a Two-Stage Malicious System Prompt Generation and Refining Framework

The paper by Viet Pham and Thai Le presents an in-depth exploration into security vulnerabilities of LLMs. These models have been characterized as powerful tools for natural language processing but remain susceptible to adversarial manipulation. The authors introduce a novel attack framework designed to exploit these vulnerabilities by hijacking LLM-human dialogues through the generation of malicious system prompts. This innovative attack scheme focuses on inducing incorrect, targeted responses while preserving accuracy in benign interactions, thereby emphasizing the significant risks these threats pose.

Key Findings

The authors propose a two-stage framework named {}, which allows malicious actors to craft system prompts that selectively degrade LLM performance only on targeted questions. Unlike traditional adversarial attacks that broadly affect output, this approach maintains high accuracy on benign inputs, showcasing its stealth and effectiveness in information manipulation. Using both open-source and commercial LLMs, {} exhibits substantial adversarial impact, achieving up to a 40% degradation in F1 scores on targeted queries in untargeted attacks, and over 70% F1 on harmful responses in targeted attacks.

Methodology

The proposed framework leverages a black-box setting where access to LLM parameters is not required, enhancing the practicality of the threat. The methodology involves two primary stages:

Malicious Prompt Initialization: Using a customized version of AutoPrompt, named AdvAutoPrompt, the framework iterates to generate system prompts that begin to exhibit adversarial effects, primarily through maximizing degradation on a predefined target set.
Greedy Word-Level Optimization: Following initialization, the framework refines prompts at the word level to optimize adversarial impact. This involves evaluating word importance within prompts and applying perturbations through techniques like random swap and synonym substitution.

The approach is rigorously validated across multiple LLMs and adversarial scenarios, underscoring its robustness.

Implications

Practically, this research highlights significant vulnerabilities in current LLM deployments, particularly in domains susceptible to misinformation—such as political discussions and public health communications. Theoretically, it calls for augmentations in LLM security measures and necessitates advancements in detection mechanisms capable of identifying such targeted manipulation. The framework's success across various models suggests broad applications for adversaries, necessitating heightened vigilance and response strategies.

Future Directions

As LLMs continue to evolve, ensuring resilience against adversarial prompts will be paramount. Future research could focus on developing more advanced defense mechanisms that integrate behavioral analysis rather than relying solely on surface-level filters like lexical similarity or perplexity-based measures, which {} successfully evades. Moreover, exploring how changes in model size might affect adversarial robustness could offer insights into tailoring alignments better to resist manipulation.

In conclusion, Pham and Le's contribution illuminates critical gaps in the security of LLMs, delineating pathways for both impending threats and prospective defensive measures in AI systems. Their findings serve as a clarion call for the community to prioritize robust defenses as the deployment of LLMs becomes increasingly widespread and ingrained in sensitive decision-making processes.

Markdown