AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe Approach (2402.09334v2)
Abstract: As LLMs are integrated into various sectors, ensuring their reliability and safety is crucial. This necessitates rigorous probing and auditing to maintain their effectiveness and trustworthiness in practical applications. Subjecting LLMs to varied iterations of a single query can unveil potential inconsistencies in their knowledge base or functional capacity. However, a tool for performing such audits with a easy to execute workflow, and low technical threshold is lacking. In this demo, we introduce ``AuditLLM,'' a novel tool designed to audit the performance of various LLMs in a methodical way. AuditLLM's primary function is to audit a given LLM by deploying multiple probes derived from a single question, thus detecting any inconsistencies in the model's comprehension or performance. A robust, reliable, and consistent LLM is expected to generate semantically similar responses to variably phrased versions of the same question. Building on this premise, AuditLLM generates easily interpretable results that reflect the LLM's consistency based on a single input question provided by the user. A certain level of inconsistency has been shown to be an indicator of potential bias, hallucinations, and other issues. One could then use the output of AuditLLM to further investigate issues with the aforementioned LLM. To facilitate demonstration and practical uses, AuditLLM offers two key modes: (1) Live mode which allows instant auditing of LLMs by analyzing responses to real-time queries; and (2) Batch mode which facilitates comprehensive LLM auditing by processing multiple queries at once for in-depth analysis. This tool is beneficial for both researchers and general users, as it enhances our understanding of LLMs' capabilities in generating responses, using a standardized auditing platform.
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
- Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 5454–5476. https://doi.org/10.18653/v1/2020.acl-main.485
- Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (¡conf-loc¿, ¡city¿Birmingham¡/city¿, ¡country¿United Kingdom¡/country¿, ¡/conf-loc¿) (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 245–255. https://doi.org/10.1145/3583780.3614905
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Can Large Language Models Provide Feedback to Students? A Case Study on ChatGPT. Technical Report. IEEE. 323–325 pages.
- A Feasibility Study of Answer-Agnostic Question Generation for Education. In Findings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 1919–1926. https://doi.org/10.18653/v1/2022.findings-acl.151
- Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1 (2021), 1–23.
- FinBERT: A large language model for extracting information from financial text. Contemporary Accounting Research 40, 2 (2023), 806–841.
- Mistral 7B.
- ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences 103 (2023), 102274.
- TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 3214–3252. https://doi.org/10.18653/v1/2022.acl-long.229
- ML-LJP: Multi-Law Aware Legal Judgment Prediction. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (¡conf-loc¿, ¡city¿Taipei¡/city¿, ¡country¿Taiwan¡/country¿, ¡/conf-loc¿) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1023–1034. https://doi.org/10.1145/3539618.3591731
- ChatGPT and large language models in academia: opportunities and challenges. BioData Mining 16, 1 (2023), 20.
- Auditing Large Language Models: A Three-Layered Approach. https://doi.org/10.1007/s43681-023-00289-2
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only.
- Vatsal Raina and Mark Gales. 2022. Multiple-Choice Question Generation: Towards an Automated Assessment Framework. arXiv:2209.11830 [cs.CL]
- Supporting Human-AI Collaboration in Auditing LLMs with LLMs. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (¡conf-loc¿, ¡city¿Montréal¡/city¿, ¡state¿QC¡/state¿, ¡country¿Canada¡/country¿, ¡/conf-loc¿) (AIES ’23). Association for Computing Machinery, New York, NY, USA, 913–926. https://doi.org/10.1145/3600211.3604712
- Marco Tulio Ribeiro and Scott Lundberg. 2022. Adaptive Testing and Debugging of NLP Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 3253–3267. https://doi.org/10.18653/v1/2022.acl-long.230
- DelucionQA: Detecting Hallucinations in Domain-specific Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 822–835. https://doi.org/10.18653/v1/2023.findings-emnlp.59
- Chirag Shah. 2024. From Prompt Engineering to Prompt Science With Human in the Loop. arXiv:2401.04122 [cs.HC]
- In chatgpt we trust? measuring and characterizing the reliability of chatgpt.
- Retrieval-Augmented Large Language Models for Adolescent Idiopathic Scoliosis Patients in Shared Decision-Making. In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (Houston, TX, USA) (BCB ’23). Association for Computing Machinery, New York, NY, USA, Article 14, 10 pages. https://doi.org/10.1145/3584371.3612956
- More Human than Human: Measuring ChatGPT Political Bias. WorkingPaper. SSRN. https://doi.org/10.2139/ssrn.4372349
- You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, Angela Fan, Suzana Ilic, Thomas Wolf, and Matthias Gallé (Eds.). Association for Computational Linguistics, virtual+Dublin, 26–41. https://doi.org/10.18653/v1/2022.bigscience-1.3
- Evaluating the Factual Consistency of Large Language Models Through News Summarization. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 5220–5255. https://doi.org/10.18653/v1/2023.findings-acl.322
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html 3, 6 (2023), 7.
- Vishesh Thakur. 2023. Unveiling gender bias in terms of profession across LLMs: Analyzing and addressing sociological implications.
- Large language models in medicine. Nature medicine 29, 8 (2023), 1930–1940.
- Llama 2: Open foundation and fine-tuned chat models.
- Zephyr: Direct distillation of lm alignment.
- DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models.
- A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3898–3908. https://doi.org/10.18653/v1/2023.findings-emnlp.256
- A large language model for electronic health records. NPJ Digital Medicine 5, 1 (2022), 194.
- Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility.
- Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models. In Proceedings of the Fourth ACM International Conference on AI in Finance (¡conf-loc¿, ¡city¿Brooklyn¡/city¿, ¡state¿NY¡/state¿, ¡country¿USA¡/country¿, ¡/conf-loc¿) (ICAIF ’23). Association for Computing Machinery, New York, NY, USA, 349–356. https://doi.org/10.1145/3604237.3626866
- Bertscore: Evaluating text generation with bert.
- Li Zhong and Zilong Wang. 2023. A study on robustness and reliability of large language model code generation.
- Maryam Amirizaniani (3 papers)
- Tanya Roosta (10 papers)
- Aman Chadha (109 papers)
- Chirag Shah (41 papers)
- Elias Martin (2 papers)