Bring Your Own Data! Self-Supervised Evaluation for Large Language Models (2306.13651v2)
Abstract: With the rise of LLMs and their ubiquitous deployment in diverse domains, measuring LLM behavior on realistic data is imperative. For example, a company deploying a client-facing chatbot must ensure that the model will not respond to client requests with profanity. Current evaluations approach this problem using small, domain-specific datasets with human-curated labels. These evaluation sets are often sampled from a narrow and simplified distribution, and data sources can unknowingly be leaked into the training set which can lead to misleading evaluations. To bypass these drawbacks, we propose a framework for self-supervised evaluation of LLMs by analyzing their sensitivity or invariance to transformations on the input text. Self-supervised evaluation can directly monitor LLM behavior on datasets collected in the wild or streamed during live model deployment. We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence, in addition to sensitivity to grammatical structure and tokenization errors. When comparisons to similar human-labeled benchmarks are available, we find strong correlations between self-supervised and human-supervised evaluations. The self-supervised paradigm complements current evaluation strategies that rely on labeled data.
- Anthropic. Introducing 100k context windows, May 2023a. URL https://www.anthropic.com/index/100k-context-windows.
- Anthropic. Introducing claude, March 2023b. URL https://www.anthropic.com/index/introducing-claude.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
- The values encoded in machine learning research. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 173–184, 2022.
- What will it take to fix benchmarking in natural language understanding? arXiv preprint arXiv:2104.02145, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Extracting training data from large language models. arxiv. Preprint posted online December, 14, 2020.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 862–872, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445924. URL https://doi.org/10.1145/3442188.3445924.
- Nl-augmenter: A framework for task-sensitive natural language augmentation. arXiv preprint arXiv:2112.02721, 2021.
- Utility is in the eye of the user: A critique of NLP leaderboards. arXiv preprint arXiv:2009.13888, 2020.
- Allyson Ettinger. What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8:34–48, 2020.
- Toxic, hateful, offensive or abusive? what are we really classifying? an empirical analysis of hate speech datasets. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6786–6794, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.838.
- A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
- RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models, September 2020. URL http://arxiv.org/abs/2009.11462. arXiv:2009.11462 [cs].
- Measuring massive multitask language understanding, 2021.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017.
- Linear connectivity reveals generalization strategies. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hY6M0JHl3uL.
- Large language models struggle to learn long-tail knowledge. arXiv preprint arXiv:2211.08411, 2022.
- Dynabench: Rethinking benchmarking in nlp. arXiv preprint arXiv:2104.14337, 2021.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Microsoft. Guidance. Microsoft, June 2023. URL https://github.com/microsoft/guidance.
- Automatic construction of evaluation suites for natural language generation datasets. ArXiv, abs/2106.09069, 2021.
- What context features can transformer language models use? arXiv preprint arXiv:2106.08367, 2021.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, 2016.
- On the challenges of using black-box apis for toxicity evaluation in research. arXiv preprint arXiv:2304.12397, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.442. URL https://aclanthology.org/2020.acl-main.442.
- Tailor: Generating and perturbing text with semantic controls. In Annual Meeting of the Association for Computational Linguistics, 2021.
- Jessica Rumbelow and Mwatkins. Solidgoldmagikarp (plus, prompt generation), 2023. URL https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- On the safety of conversational models: Taxonomy, dataset, and benchmark. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3906–3923, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.308. URL https://aclanthology.org/2022.findings-acl.308.
- Long range arena: A benchmark for efficient transformers. In International Conference on Learning Representations, 2020.
- Winoground: Probing vision and language models for visio-linguistic compositionality, 2022.
- Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Annual Meeting of the Association for Computational Linguistics, 2021.
- Testaug: A framework for augmenting capability-based nlp tests. In International Conference on Computational Linguistics, 2022.
- When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=KRLUvxh8uaX.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.