Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference (2406.17626v1)

Published 25 Jun 2024 in cs.CL and cs.AI

Abstract: As LLMs constantly evolve, ensuring their safety remains a critical research problem. Previous red-teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Erxin Yu (6 papers)
  2. Jing Li (621 papers)
  3. Ming Liao (12 papers)
  4. Siqi Wang (68 papers)
  5. Zuchen Gao (3 papers)
  6. Fei Mi (56 papers)
  7. Lanqing Hong (72 papers)
Citations (3)