An Auditing Test To Detect Behavioral Shift in Language Models (2410.19406v2)
Abstract: As LMs approach human-level performance, a comprehensive understanding of their behavior becomes crucial. This includes evaluating capabilities, biases, task performance, and alignment with societal values. Extensive initial evaluations, including red teaming and diverse benchmarking, can establish a model's behavioral profile. However, subsequent fine-tuning or deployment modifications may alter these behaviors in unintended ways. We present a method for continual Behavioral Shift Auditing (BSA) in LMs. Building on recent work in hypothesis testing, our auditing test detects behavioral shifts solely through model generations. Our test compares model generations from a baseline model to those of the model under scrutiny and provides theoretical guarantees for change detection while controlling false positives. The test features a configurable tolerance parameter that adjusts sensitivity to behavioral changes for different use cases. We evaluate our approach using two case studies: monitoring changes in (a) toxicity and (b) translation performance. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.
- Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 298–306, 2021.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Survey on watermarking methods in the artificial intelligence domain and beyond. Computer Communications, 188:52–65, 2022.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Bayes and minimax solutions of sequential decision problems. Econometrica, Journal of the Econometric Society, pp. 213–244, 1949.
- From concept drift to model degradation: An overview on performance-aware drift detectors. Knowledge-Based Systems, 245:108632, 2022.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016.
- Deceptive alignment monitoring. arXiv preprint arXiv:2307.10569, 2023.
- Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799, 2020.
- Jailbreaking black box large language models in twenty queries. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023.
- Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528, 2023.
- Auditing fairness by betting. Advances in Neural Information Processing Systems, 36:6070–6091, 2023.
- Veridl: Integrity verification of outsourced deep learning services. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 583–598, 2021.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Validating the integrity of convolutional neural network predictions based on zero-knowledge proof. Information Sciences, 625:125–140, 2023.
- How will language modelers like chatgpt affect occupations and industries? arXiv preprint arXiv:2303.01157, 2023.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Safetynets: Verifiable execution of deep neural networks on an untrusted cloud. Advances in Neural Information Processing Systems, 30, 2017.
- Thilo Hagendorff. Deception abilities emerged in large language models. Proceedings of the National Academy of Sciences, 121(24):e2317967121, 2024.
- Protecting intellectual property of language generation apis with lexical watermark. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):10758–10766, Jun. 2022a. doi: 10.1609/aaai.v36i10.21321. URL https://ojs.aaai.org/index.php/AAAI/article/view/21321.
- CATER: Intellectual property protection on text generation APIs via conditional watermarks. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=L7P3IvsoUXY.
- Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021.
- Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
- Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2), 2021.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
- Group sequential methods with applications to clinical trials. CRC Press, 1999.
- Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Black box to white box: Discover model characteristics based on strategic probing. In 2020 Third International Conference on Artificial Intelligence for Industries (AI4I), pp. 60–63. IEEE, 2020.
- LV Kantorovich and GS Rubinstein. On a space of completely additive functions, vestn. leningr. univ. 13 (7)(1958) 52-59, 1958.
- Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023.
- Gpt-4 passes the bar exam. Philosophical Transactions of the Royal Society A, 382(2270):20230254, 2024.
- A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023.
- Gender bias and stereotypes in large language models. In Proceedings of the ACM collective intelligence conference, pp. 12–24, 2023.
- Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023.
- Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660, 2020.
- A new generation of perspective api: Efficient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pp. 3197–3207, 2022.
- A sequential non-parametric multivariate two-sample test. IEEE Transactions on Information Theory, 64(5):3361–3370, 2018.
- Measuring and controlling instruction (in)stability in language model dialogs. In COLM, 2024a.
- Glitch tokens in large language models: categorization taxonomy and effective detection. Proceedings of the ACM on Software Engineering, 1(FSE):2075–2097, 2024b.
- Llama-team. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/, 2024. Accessed: 2024-05-15.
- Revisiting classifier two-sample tests. In International Conference on Learning Representations, 2017.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Microsoft. Azure AI content safety - prompt shields. https://azure.github.io/Azure-AI-Content-Safety-Private-Preview/Prompt%20Shields.html, 2024. Accessed: 2024.
- Lila: A unified benchmark for mathematical reasoning. arXiv preprint arXiv:2210.17517, 2022a.
- Cross-task generalization via natural language crowdsourcing instructions. In ACL, 2022b.
- A glitch in the matrix? locating and detecting language model grounding with fakepedia. In ACL 2024, 2024.
- Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in applied probability, 29(2):429–443, 1997.
- Crows-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1953–1967, 2020.
- The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.
- Honest: Measuring hurtful sentence completion in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021.
- OpenAI. Gpt-3.5-turbo. https://platform.openai.com/docs/models/gpt-3-5, 2023. Accessed: 27/09/2024.
- OpenAI. Introducing openai O(1) preview. https://openai.com/blog/introducing-openai-o1-preview/, 2024. Accessed: 2024-10-23.
- Deep anytime-valid hypothesis testing. In AISTATS, volume 238 of Proceedings of Machine Learning Research, pp. 622–630. PMLR, 2024.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pp. 1–22, 2023.
- Bbq: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pp. 2086–2105, 2022.
- Red teaming language models with language models. In EMNLP, pp. 3419–3448. Association for Computational Linguistics, 2022.
- Discovering language model behaviors with model-written evaluations. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 13387–13434, 2023.
- Evaluating frontier models for dangerous capabilities. arXiv preprint arXiv:2403.13793, 2024.
- Sequential predictive two-sample and independence testing. Advances in neural information processing systems, 36, 2024.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
- Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
- Game-theoretic statistics and safe anytime-valid inference. Statistical Science, 38(4):576–601, 2023.
- Herbert Robbins. Statistical methods related to the law of the iterated logarithm. The Annals of Mathematical Statistics, 41(5):1397–1409, 1970.
- Solid: A large-scale semi-supervised dataset for offensive language identification. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 915–928, 2021.
- Glenn Shafer. Testing by betting: A strategy for statistical and scientific communication. Journal of the Royal Statistical Society Series A: Statistics in Society, 184(2):407–431, 2021.
- On the ethics of building ai in a responsible manner. arXiv preprint arXiv:2004.04644, 2020.
- Towards understanding sycophancy in language models. In The Twelfth International Conference on Learning Representations, 2023.
- Nonparametric two-sample testing by betting. IEEE Transactions on Information Theory, 2023.
- Risk-limiting financial audits via weighted sampling without replacement. In Uncertainty in Artificial Intelligence, pp. 1932–1941. PMLR, 2023.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
- A survey on active learning: State-of-the-art, practical challenges and research directions. Mathematics, 11(4):820, 2023.
- Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827, 2024.
- A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987, 2023.
- Learning from the worst: Dynamically generated datasets to improve online hate detection. arXiv preprint arXiv:2012.15761, 2020.
- Jean Ville. Etude critique de la notion de collectif. Gauthier-Villars Paris, 1939.
- A Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2):117–186, 1945.
- Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. In NeurIPS, 2023a.
- Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP, 2022.
- How far can camels go? exploring the state of instruction tuning on open resources. Advances in Neural Information Processing Systems, 36:74764–74786, 2023b.
- Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024.
- Rilacs: risk limiting audits via confidence sequences. In Electronic Voting: 6th International Joint Conference, E-Vote-ID 2021, Virtual Event, October 5–8, 2021, Proceedings 6, pp. 124–139. Springer, 2021.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
- pvcnn: Privacy-preserving and verifiable convolutional neural network testing. IEEE Transactions on Information Forensics and Security, 18:2218–2233, 2023.
- Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
- Fundamental limitations of alignment in large language models. In Forty-first International Conference on Machine Learning, 2024.
- Robust multi-bit natural language watermarking through invariant features. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2092–2115, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.117. URL https://aclanthology.org/2023.acl-long.117.
- Assessing the potential of gpt-4 to perpetuate racial and gender biases in health care: a model evaluation study. The Lancet Digital Health, 6(1):e12–e22, 2024.
- Predicting the type and target of offensive posts in social media. In Proceedings of NAACL-HLT, pp. 1415–1420, 2019.
- Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57, 2024. doi: 10.1162/tacl_a_00632. URL https://aclanthology.org/2024.tacl-1.3.
- Constructing highly inductive contexts for dialogue safety through controllable reverse generation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 3684–3697, 2022.
- Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 15–20, 2018.
- Hidden: Hiding data with deep networks. In Proceedings of the European conference on computer vision (ECCV), pp. 657–672, 2018.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.