Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe Approach (2402.09334v2)

Published 14 Feb 2024 in cs.AI

Abstract: As LLMs are integrated into various sectors, ensuring their reliability and safety is crucial. This necessitates rigorous probing and auditing to maintain their effectiveness and trustworthiness in practical applications. Subjecting LLMs to varied iterations of a single query can unveil potential inconsistencies in their knowledge base or functional capacity. However, a tool for performing such audits with a easy to execute workflow, and low technical threshold is lacking. In this demo, we introduce ``AuditLLM,'' a novel tool designed to audit the performance of various LLMs in a methodical way. AuditLLM's primary function is to audit a given LLM by deploying multiple probes derived from a single question, thus detecting any inconsistencies in the model's comprehension or performance. A robust, reliable, and consistent LLM is expected to generate semantically similar responses to variably phrased versions of the same question. Building on this premise, AuditLLM generates easily interpretable results that reflect the LLM's consistency based on a single input question provided by the user. A certain level of inconsistency has been shown to be an indicator of potential bias, hallucinations, and other issues. One could then use the output of AuditLLM to further investigate issues with the aforementioned LLM. To facilitate demonstration and practical uses, AuditLLM offers two key modes: (1) Live mode which allows instant auditing of LLMs by analyzing responses to real-time queries; and (2) Batch mode which facilitates comprehensive LLM auditing by processing multiple queries at once for in-depth analysis. This tool is beneficial for both researchers and general users, as it enhances our understanding of LLMs' capabilities in generating responses, using a standardized auditing platform.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
  2. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 5454–5476. https://doi.org/10.18653/v1/2020.acl-main.485
  3. Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (¡conf-loc¿, ¡city¿Birmingham¡/city¿, ¡country¿United Kingdom¡/country¿, ¡/conf-loc¿) (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 245–255. https://doi.org/10.1145/3583780.3614905
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  5. Can Large Language Models Provide Feedback to Students? A Case Study on ChatGPT. Technical Report. IEEE. 323–325 pages.
  6. A Feasibility Study of Answer-Agnostic Question Generation for Education. In Findings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 1919–1926. https://doi.org/10.18653/v1/2022.findings-acl.151
  7. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1 (2021), 1–23.
  8. FinBERT: A large language model for extracting information from financial text. Contemporary Accounting Research 40, 2 (2023), 806–841.
  9. Mistral 7B.
  10. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences 103 (2023), 102274.
  11. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 3214–3252. https://doi.org/10.18653/v1/2022.acl-long.229
  12. ML-LJP: Multi-Law Aware Legal Judgment Prediction. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (¡conf-loc¿, ¡city¿Taipei¡/city¿, ¡country¿Taiwan¡/country¿, ¡/conf-loc¿) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1023–1034. https://doi.org/10.1145/3539618.3591731
  13. ChatGPT and large language models in academia: opportunities and challenges. BioData Mining 16, 1 (2023), 20.
  14. Auditing Large Language Models: A Three-Layered Approach. https://doi.org/10.1007/s43681-023-00289-2
  15. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only.
  16. Vatsal Raina and Mark Gales. 2022. Multiple-Choice Question Generation: Towards an Automated Assessment Framework. arXiv:2209.11830 [cs.CL]
  17. Supporting Human-AI Collaboration in Auditing LLMs with LLMs. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (¡conf-loc¿, ¡city¿Montréal¡/city¿, ¡state¿QC¡/state¿, ¡country¿Canada¡/country¿, ¡/conf-loc¿) (AIES ’23). Association for Computing Machinery, New York, NY, USA, 913–926. https://doi.org/10.1145/3600211.3604712
  18. Marco Tulio Ribeiro and Scott Lundberg. 2022. Adaptive Testing and Debugging of NLP Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 3253–3267. https://doi.org/10.18653/v1/2022.acl-long.230
  19. DelucionQA: Detecting Hallucinations in Domain-specific Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 822–835. https://doi.org/10.18653/v1/2023.findings-emnlp.59
  20. Chirag Shah. 2024. From Prompt Engineering to Prompt Science With Human in the Loop. arXiv:2401.04122 [cs.HC]
  21. In chatgpt we trust? measuring and characterizing the reliability of chatgpt.
  22. Retrieval-Augmented Large Language Models for Adolescent Idiopathic Scoliosis Patients in Shared Decision-Making. In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (Houston, TX, USA) (BCB ’23). Association for Computing Machinery, New York, NY, USA, Article 14, 10 pages. https://doi.org/10.1145/3584371.3612956
  23. More Human than Human: Measuring ChatGPT Political Bias. WorkingPaper. SSRN. https://doi.org/10.2139/ssrn.4372349
  24. You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, Angela Fan, Suzana Ilic, Thomas Wolf, and Matthias Gallé (Eds.). Association for Computational Linguistics, virtual+Dublin, 26–41. https://doi.org/10.18653/v1/2022.bigscience-1.3
  25. Evaluating the Factual Consistency of Large Language Models Through News Summarization. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 5220–5255. https://doi.org/10.18653/v1/2023.findings-acl.322
  26. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html 3, 6 (2023), 7.
  27. Vishesh Thakur. 2023. Unveiling gender bias in terms of profession across LLMs: Analyzing and addressing sociological implications.
  28. Large language models in medicine. Nature medicine 29, 8 (2023), 1930–1940.
  29. Llama 2: Open foundation and fine-tuned chat models.
  30. Zephyr: Direct distillation of lm alignment.
  31. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models.
  32. A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3898–3908. https://doi.org/10.18653/v1/2023.findings-emnlp.256
  33. A large language model for electronic health records. NPJ Digital Medicine 5, 1 (2022), 194.
  34. Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility.
  35. Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models. In Proceedings of the Fourth ACM International Conference on AI in Finance (¡conf-loc¿, ¡city¿Brooklyn¡/city¿, ¡state¿NY¡/state¿, ¡country¿USA¡/country¿, ¡/conf-loc¿) (ICAIF ’23). Association for Computing Machinery, New York, NY, USA, 349–356. https://doi.org/10.1145/3604237.3626866
  36. Bertscore: Evaluating text generation with bert.
  37. Li Zhong and Zilong Wang. 2023. A study on robustness and reliability of large language model code generation.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Maryam Amirizaniani (3 papers)
  2. Tanya Roosta (10 papers)
  3. Aman Chadha (109 papers)
  4. Chirag Shah (41 papers)
  5. Elias Martin (2 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com