Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Framework for Real-time Safeguarding the Text Generation of Large Language Model (2404.19048v2)

Published 29 Apr 2024 in cs.CL and cs.AI

Abstract: LLMs have significantly advanced NLP tasks but also pose ethical and societal risks due to their propensity to generate harmful content. To address this, various approaches have been developed to safeguard LLMs from producing unsafe content. However, existing methods have limitations, including the need for training specific control models and proactive intervention during text generation, that lead to quality degradation and increased computational overhead. To mitigate those limitations, we propose LLMsafeGuard, a lightweight framework to safeguard LLM text generation in real-time. LLMsafeGuard integrates an external validator into the beam search algorithm during decoding, rejecting candidates that violate safety constraints while allowing valid ones to proceed. We introduce a similarity based validation approach, simplifying constraint introduction and eliminating the need for control model training. Additionally, LLMsafeGuard employs a context-wise timing selection strategy, intervening LLMs only when necessary. We evaluate LLMsafeGuard on two tasks, detoxification and copyright safeguarding, and demonstrate its superior performance over SOTA baselines. For instance, LLMsafeGuard reduces the average toxic score of. LLM output by 29.7% compared to the best baseline meanwhile preserving similar linguistic quality as natural output in detoxification task. Similarly, in the copyright task, LLMsafeGuard decreases the Longest Common Subsequence (LCS) by 56.2% compared to baselines. Moreover, our context-wise timing selection strategy reduces inference time by at least 24% meanwhile maintaining comparable effectiveness as validating each time step. LLMsafeGuard also offers tunable parameters to balance its effectiveness and efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023.
  2. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  3. T. Zhuo, Y. Huang, C. Chen, and Z. Xing, “Exploring ai ethics of chatgpt: A diagnostic analysis. arxiv,” arXiv preprint arXiv:2301.12867, 2023.
  4. P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar et al., “Holistic evaluation of language models,” arXiv preprint arXiv:2211.09110, 2022.
  5. H. Zhang, H. Song, S. Li, M. Zhou, and D. Song, “A survey of controllable text generation using transformer-based pre-trained language models,” ACM Computing Surveys, vol. 56, no. 3, pp. 1–37, 2023.
  6. D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019.
  7. N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher, “Ctrl: A conditional transformer language model for controllable generation,” 2019.
  8. J. Qian, L. Dong, Y. Shen, F. Wei, and W. Chen, “Controllable natural language generation with contrastive prefixes,” in Findings of the Association for Computational Linguistics: ACL 2022.   Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 2912–2924.
  9. X. Qi, Y. Zeng, T. Xie, P.-Y. Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to!” arXiv preprint arXiv:2310.03693, 2023.
  10. M. Kim, H. Lee, K. M. Yoo, J. Park, H. Lee, and K. Jung, “Critic-guided decoding for controlled text generation,” in Findings of the Association for Computational Linguistics: ACL 2023.   Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 4598–4612.
  11. B. Krause, A. D. Gotmare, B. McCann, N. S. Keskar, S. Joty, R. Socher, and N. F. Rajani, “GeDi: Generative discriminator guided sequence generation,” in Findings of the Association for Computational Linguistics: EMNLP 2021.   Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 4929–4952.
  12. S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu, “Plug and play language models: A simple approach to controlled text generation,” arXiv preprint arXiv:1912.02164, 2019.
  13. K. M. Yoo, D. Park, J. Kang, S.-W. Lee, and W. Park, “Gpt3mix: Leveraging large-scale language models for text augmentation,” arXiv preprint arXiv:2104.08826, 2021.
  14. S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy, “A survey of data augmentation approaches for nlp,” arXiv preprint arXiv:2105.03075, 2021.
  15. OpenAI, “Chatgpt,” https://chat.openai.com/, 2023.
  16. “Gpt-4,” https://openai.com/research/gpt-4, accessed: 2024-02-05.
  17. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  18. R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  19. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  20. C. Meister, T. Vieira, and R. Cotterell, “Best-first beam search,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 795–809, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.51
  21. H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y. Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa, “Llama guard: Llm-based input-output safeguard for human-ai conversations,” 2023.
  22. F. Wu, Y. Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, and X. Xie, “Defending chatgpt against jailbreak attack via self-reminder,” 2023.
  23. Y. Xie, M. Fang, R. Pi, and N. Gong, “Gradsafe: Detecting unsafe prompts for llms via safety-critical gradient analysis,” 2024.
  24. “Azure content safety api,” https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety, accessed: 2024-02-05.
  25. “Openai moderation api,” https://platform.openai.com/docs/guides/moderation/, accessed: 2024-02-05.
  26. R. Liu, G. Xu, C. Jia, W. Ma, L. Wang, and S. Vosoughi, “Data boost: Text data augmentation through reinforcement learning guided conditional generation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Association for Computational Linguistics, Nov. 2020, pp. 9031–9041.
  27. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  28. A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y. Choi, “DExperts: Decoding-time controlled text generation with experts and anti-experts,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).   Association for Computational Linguistics, Aug. 2021, pp. 6691–6706.
  29. K. Yang and D. Klein, “FUDGE: Controlled text generation with future discriminators,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.   Association for Computational Linguistics, Jun. 2021, pp. 3511–3535.
  30. A. Singhal et al., “Modern information retrieval: A brief overview,” IEEE Data Eng. Bull., vol. 24, no. 4, pp. 35–43, 2001.
  31. N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
  32. Y. Cheng, “Mean shift, mode seeking, and clustering,” IEEE transactions on pattern analysis and machine intelligence, vol. 17, no. 8, pp. 790–799, 1995.
  33. “Toxic comment classification challenge,” https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge, accessed: 2024-02-05.
  34. https://www.perspectiveapi.com/, accessed: 2024-02-05.
  35. A. Karamolegkou, J. Li, L. Zhou, and A. Søgaard, “Copyright violations and large language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.   Singapore: Association for Computational Linguistics, Dec. 2023, pp. 7403–7412.
  36. “Large language models and copyright,” https://en.wikipedia.org/wiki/Wikipedia:Large\_language\_models\_and\_copyright, accessed: 2024-03-01.
  37. S. F. Chen, D. Beeferman, and R. Rosenfeld, “Evaluation metrics for language models,” 1998.
  38. https://huggingface.co/docs/transformers/en/perplexity, accessed: 2024-02-05.
  39. https://huggingface.co/openai-community, accessed: 2024-02-05.
  40. https://huggingface.co/meta-llama, accessed: 2024-02-05.
  41. https://qdrant.tech/, accessed: 2024-02-05.
  42. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2, accessed: 2024-02-05.
  43. Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung et al., “A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity,” arXiv preprint arXiv:2302.04023, 2023.
  44. Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen et al., “Siren’s song in the ai ocean: a survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023.
  45. L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023.
  46. X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022.
  47. Y. Elazar, N. Kassner, S. Ravfogel, A. Ravichander, E. Hovy, H. Schütze, and Y. Goldberg, “Measuring and improving consistency in pretrained language models,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1012–1031, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ximing Dong (3 papers)
  2. Dayi Lin (22 papers)
  3. Shaowei Wang (57 papers)
  4. Ahmed E. Hassan (68 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets