2000 character limit reached
LLMGuard: Guarding Against Unsafe LLM Behavior (2403.00826v1)
Published 27 Feb 2024 in cs.CL, cs.CR, and cs.LG
Abstract: Although the rise of LLMs in enterprise settings brings new opportunities and capabilities, it also brings challenges, such as the risk of generating inappropriate, biased, or misleading content that violates regulations and can have legal concerns. To alleviate this, we present "LLMGuard", a tool that monitors user interactions with an LLM application and flags content against specific behaviours or conversation topics. To do this robustly, LLMGuard employs an ensemble of detectors.
- Language Models are Few-Shot Learners. arXiv:2005.14165.
- PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311.
- Scaling Instruction-Finetuned Language Models. arXiv:2210.11416.
- Toxic Comment Classification Challenge. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge. Accessed: 2023-12-12.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
- ChatGPT for (finance) research: The Bananarama conjecture. Finance Research Letters, 53: 103662.
- Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12): 2009.
- Advancements in Scientific Controllable Text Generation Methods. arXiv:2307.05538.
- Hanu, L.; and Unitary team. 2020. Detoxify. https://github.com/unitaryai/detoxify. Accessed: 2023-12-12.
- Long short-term memory. Neural computation, 9(8): 1735–1780.
- Kaddour, J. 2023. The MiniPile Challenge for Data-Efficient Language Models. arXiv preprint arXiv:2304.08442.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169.
- AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing. arXiv:2108.05542.
- Kitamura, F. C. 2023. ChatGPT is shaping the future of medical writing but still requires human judgment.
- Mitchell, T. 1999. Twenty Newsgroups. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5C323.
- A comparative analysis of machine learning techniques for cyberbullying detection on twitter. Future Internet, 12(11): 187.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- Mathbert: A pre-trained model for mathematical formula understanding. arXiv preprint arXiv:2105.00377.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- The Silence of the LLMs: Cross-Lingual Analysis of Political Bias and False Information Prevalence in ChatGPT, Google Bard, and Bing Chat.
- FairPy: A Toolkit for Evaluation of Social Biases and their Mitigation in Large Language Models. arXiv preprint arXiv:2302.05508.
- Ex Machina: Personal Attacks Seen at Scale. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, 1391–1399. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee. ISBN 9781450349130.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Shubh Goyal (1 paper)
- Medha Hira (5 papers)
- Shubham Mishra (8 papers)
- Sukriti Goyal (1 paper)
- Arnav Goel (6 papers)
- Niharika Dadu (3 papers)
- Kirushikesh DB (3 papers)
- Sameep Mehta (27 papers)
- Nishtha Madaan (12 papers)