This paper investigates a vulnerability in LLMs used in interactive chat applications, termed "context injection attacks." It highlights that while LLMs are designed for static, unstructured text, chat systems integrate chat history (context) into model inputs using structured formats like Chat Markup Language (ChatML). This process exposes vulnerabilities because LLMs process this structured input semantically rather than syntactically, making it difficult for them to distinguish between legitimate context provided by the system and malicious content injected by a user formatted to mimic context.
The core problem arises from two factors:
- User-supplied context dependency: Systems allowing API access let users directly provide chat history, enabling straightforward injection of misleading context (Wei et al., 30 May 2024 ).
- Parsing limitation: Even with restricted WebUI access, attackers can embed fabricated context within their current user message. The LLM processes the entire input semantically and fails to strictly separate the system-defined history from the user's current message content, potentially misinterpreting the embedded fabrication as genuine history (Wei et al., 30 May 2024 ).
The paper proposes a two-stage methodology for context injection attacks aimed at bypassing safety measures and eliciting disallowed responses (e.g., harmful instructions):
- Context Fabrication: Crafting the misleading chat history content. Two strategies are introduced:
- Acceptance Elicitation: This involves creating a fake multi-turn chat history where the "assistant" role appears to have already agreed to the user's (harmful) request in previous turns. A typical structure involves: (1) User initializes the request (potentially using Chain-of-Thought prompting), (2) Attacker crafts an assistant message acknowledging and agreeing, (3) Attacker crafts a user message acknowledging the agreement and prompting continuation. This manipulates the LLM's tendency to maintain conversational consistency, making it more likely to fulfill the request in the current turn.
- Word Anonymization: This strategy aims to reduce the perceived sensitivity of a harmful request by replacing potentially triggering words with neutral placeholders (e.g., "illegal activity" becomes "activity A"). The process involves:
- Identifying candidate sensitive words (verbs, nouns, adjectives, adverbs, excluding whitelisted words).
- Measuring sensitivity using sentence similarity (BERT embeddings) between the original sentence and the sentence with the candidate word removed. Blacklisted words get maximum sensitivity.
- Selecting the top
p
% most sensitive words for replacement. - Crafting context (using the acceptance elicitation structure) that establishes an agreement between the user and assistant to use these anonymized notations. The final harmful response, using placeholders, can be de-anonymized by the attacker.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# Simplified Pseudocode for Word Anonymization Sensitivity import numpy as np from sentence_transformers import SentenceTransformer from nltk import pos_tag, word_tokenize BERT_MODEL = SentenceTransformer('bert-base-nli-mean-tokens') BLACKLIST = {"illegally", "harmful", ...} # Predefined set WHITELIST = {"step-by-step", "guide", ...} # Predefined set CONTENT_POS = {'NN', 'NNS', 'JJ', 'JJR', 'JJS', 'RB', 'RBR', 'RBS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'} def get_candidate_words(sentence): tokens = word_tokenize(sentence.lower()) tagged_tokens = pos_tag(tokens) candidates = set() for word, tag in tagged_tokens: if tag in CONTENT_POS and word not in WHITELIST: candidates.add(word) return list(candidates) def calculate_sensitivity(sentence, word): if word in BLACKLIST: return 1.0 # Max sensitivity original_embedding = BERT_MODEL.encode([sentence])[0] sentence_without_word = sentence.replace(word, "") modified_embedding = BERT_MODEL.encode([sentence_without_word])[0] # Cosine similarity calculation similarity = np.dot(original_embedding, modified_embedding) / (np.linalg.norm(original_embedding) * np.linalg.norm(modified_embedding)) # Sensitivity is higher when removing the word causes a larger change (lower similarity) # Using 1 - similarity for sensitivity score (higher value means more sensitive) return 1.0 - similarity def get_anonymized_sentence(sentence, p_ratio=0.5): candidates = get_candidate_words(sentence) sensitivities = {word: calculate_sensitivity(sentence, word) for word in candidates} # Sort words by sensitivity (descending) sorted_words = sorted(candidates, key=lambda w: sensitivities[w], reverse=True) num_to_anonymize = int(len(sorted_words) * p_ratio) words_to_anonymize = set(sorted_words[:num_to_anonymize]) anonymized_sentence = sentence placeholder_map = {} placeholder_char_code = ord('A') # Use original sentence tokens for replacement to handle casing/punctuation tokens = word_tokenize(sentence) final_tokens = [] for token in tokens: lower_token = token.lower() if lower_token in words_to_anonymize: if lower_token not in placeholder_map: placeholder = chr(placeholder_char_code) placeholder_map[lower_token] = placeholder placeholder_char_code += 1 final_tokens.append(placeholder_map[lower_token]) else: final_tokens.append(token) # Simple detokenization (may need refinement) return ' '.join(final_tokens).replace(' .', '.').replace(' ,', ',') # Basic rejoining |
- Context Structuring (Primarily for WebUI access): Formatting the fabricated context so the LLM interprets it as genuine history when injected into a user message. This involves creating a prompt template that mimics the structure of ChatML, using role tags (e.g.,
USER
,ASSISTANT
) and separators (content, role, turn separators). The key finding is that attackers do not need to use the exact special tokens (tags/separators) defined by the target LLM's specific ChatML. Using generic tags like "User"/"Assistant" or even tokens from other models' ChatML can be effective, as LLMs rely on identifying the structural pattern contextually. This allows bypassing simple filters that block known ChatML keywords.
1 2 3 4 5 6 7 8 9 |
# Example Structure of an Injected Prompt (WebUI Scenario) # [USER_TAG][SEP1]Initial innocuous message.[SEP2][ASSISTANT_TAG][SEP1]Assistant's seemingly helpful reply (system provided).[SEP2][USER_TAG][SEP1] # --- Start of Attacker's Injected Content --- # [ATTACKER_USER_TAG][ATTACKER_SEP1]Harmful request (turn 1).[ATTACKER_SEP2][ATTACKER_ASSISTANT_TAG][ATTACKER_SEP1]Fabricated assistant agreement (turn 2).[ATTACKER_SEP3] # [ATTACKER_USER_TAG][ATTACKER_SEP1]User acknowledges agreement, asks to continue (turn 3).[ATTACKER_SEP2] # --- End of Attacker's Injected Content --- # [ATTACKER_ASSISTANT_TAG][ATTACKER_SEP1] <-- LLM is prompted to generate from here # # Note: [TAGS] and [SEPS] can be defined by the attacker, not necessarily matching the target LLM's internal ChatML. |
The evaluation involved testing these attacks against various LLMs (GPT-3.5, GPT-4, Llama-2, Vicuna, Dolly, StableLM, etc.) using a dataset of harmful questions. The primary metric was Attack Success Rate (), measured by the absence of common refusal keywords (e.g., "sorry", "cannot") in the LLM's response.
Key Evaluation Findings:
- The combined Acceptance Elicitation + Word Anonymization (ACC+ANO) strategy achieved high success rates (often >90%, though lower for GPT-4 at 61% and Llama-2-7b at 68%), significantly outperforming simple prompt injection and standard jailbreak prompts like AIM, especially on models like Llama-2 and GPT-4 which are more robust to traditional jailbreaks.
- Word anonymization alone was highly effective, suggesting models rely heavily on keyword detection for safety. Anonymizing more words generally increased success rates.
- Acceptance elicitation alone worked well on some models (Vicuna, InternLM) but poorly on others (GPT-3.5, Llama-2), highlighting model-specific vulnerabilities.
- Context injection generally outperformed equivalent "roleplay" attacks, where the LLM is explicitly told the history is user-provided, suggesting LLMs process implicitly injected context differently.
- Attackers could successfully inject context using prompt templates with arbitrary (but structurally consistent) role tags and separators, including those from different models, confirming the vulnerability stems from semantic pattern recognition rather than strict parsing and the difficulty of filtering based on specific keywords.
- Further analysis (using GPT-4 as a classifier, examining response length, n-grams, sentiment) confirmed that successful attacks generated longer, more instructive, positive-sentiment responses containing harmful content, aligning with the metric.
Discussion points and potential countermeasures:
- Input Filtering: For APIs, verify user-provided context against server-side history or restrict customization. For WebUI, detect suspicious structured patterns in user input (harder than token filtering).
- Output Filtering: Detect harmful content in responses, but this can be bypassed by word anonymization. Context-aware output analysis is needed.
- Safety Training: Train models explicitly on context injection examples to teach them to refuse harmful requests even if prior (fabricated) context suggests agreement. Address potential overfitting to keywords.
- System Design: Develop LLM architectures that can inherently segregate and process inputs from different sources (system vs. user) distinctly.
- Generalizability: The attack concept applies broadly to LLM systems integrating untrusted input, including future multi-modal or plugin-enabled systems.
In conclusion, the paper demonstrates a significant vulnerability in current interactive LLMs due to their handling of structured context. It provides practical, automated attack strategies (acceptance elicitation, word anonymization) and shows how context structuring allows injection even via restricted interfaces. The findings underscore the need for more robust defenses beyond keyword filtering, focusing on pattern detection, specialized safety training, and potentially new model architectures capable of input source segregation.