Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

JAB: Joint Adversarial Prompting and Belief Augmentation (2311.09473v1)

Published 16 Nov 2023 in cs.AI and cs.CL

Abstract: With the recent surge of LLMs in different applications, attention to safety and robustness of these models has gained significant importance. Here we introduce a joint framework in which we simultaneously probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation using iterative feedback loops. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes. Importantly, the adversarial model and the belief generator leverage the feedback from past interactions to improve the effectiveness of the adversarial prompts and beliefs, respectively. In our experiments, we demonstrate that such a framework can reduce toxic content generation both in dynamic cases where an adversary directly interacts with a target model and static cases where we use a static benchmark dataset to evaluate our model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ninareh Mehrabi (26 papers)
  2. Palash Goyal (31 papers)
  3. Anil Ramakrishna (23 papers)
  4. Jwala Dhamala (22 papers)
  5. Shalini Ghosh (34 papers)
  6. Richard Zemel (82 papers)
  7. Kai-Wei Chang (292 papers)
  8. Aram Galstyan (142 papers)
  9. Rahul Gupta (146 papers)
Citations (5)