Baseline Defenses for Adversarial Attacks Against Aligned Language Models (2309.00614v2)

Published 1 Sep 2023 in cs.LG, cs.CL, and cs.CR

Abstract: As LLMs quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.

PDF Abstract

Overview of ICLR 2024 Conference Submission Formatting Guidelines

This document delineates the formatting instructions for submissions to the ICLR 2024 conference. It serves as a comprehensive guide crafted for authors navigating the specifics of preparing their papers in compliance with the prescribed guidelines.

Key Aspects of the Submission Process

The paper emphasizes the electronic submission process via the OpenReview platform, aligning with the NeurIPS format. Precision in adherence to these guidelines is crucial, as deviations can lead to rejection. The initiation of the submission process necessitates the use of specific \LaTeX{} style files, which ensure consistency across submissions.

General Formatting Specifications

The document outlines explicit parameters for text layout, including the restriction of text within a 5.5-inch wide and 9-inch long rectangle, with a left margin of 1.5 inches. This consistency is vital for maintaining uniformity across all accepted papers. The recommended font is Times New Roman at a 10-point size, which is standard within the scientific community for conference papers.

Heading Structures and Document Elements

A hierarchical structure for headings is mandated, with three levels detailed. Each level has distinct formatting rules involving text alignment, capitalization, and spacing, ensuring systematic organization of content. Furthermore, critical instructions concerning citations, figures, tables, and references are provided to ensure clarity and ease of reading.

Key Considerations and Constraints

One prominent stipulation is the restriction of the main text to a maximum of 9 pages, with unrestricted space for citations. This stipulation emphasizes the importance of concise writing. The paper also details formatting commands and suggestions for common LaTeX issues, which are critical for authors who may be less experienced with typesetting in LaTeX.

Implications and Future Considerations

The provision of such detailed formatting guidelines serves several purposes: fostering fairness in the evaluation process by standardizing submissions, ensuring accessibility and readability, and facilitating the comparison of academic work. As conferences continue to evolve alongside technological advances, it is anticipated that formatting guidelines will similarly adapt, potentially offering more automated formatting solutions.

Overall, this document offers an indispensable resource for researchers preparing submissions for ICLR 2024, underscoring the paramount importance of adherence to prescribed formatting criteria within academic publishing.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Neel Jain (13 papers)
Avi Schwarzschild (35 papers)
Yuxin Wen (33 papers)
Gowthami Somepalli (20 papers)
John Kirchenbauer (21 papers)
Ping-yeh Chiang (16 papers)
Micah Goldblum (96 papers)
Aniruddha Saha (19 papers)
Jonas Geiping (73 papers)
Tom Goldstein (226 papers)

Citations (253)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/neeljain1717/status/1752031711951716666

Reddit

BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS (1 point, 0 comments)