Mass-Producing Failures of Multimodal Systems with Language Models (2306.12105v2)
Abstract: Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MultiMon, a system that automatically identifies systematic failures -- generalizable, natural-language descriptions of patterns of model failures. To uncover systematic failures, MultiMon scrapes a corpus for examples of erroneous agreement: inputs that produce the same output, but should not. It then prompts a LLM (e.g., GPT-4) to find systematic patterns of failure and describe them in natural language. We use MultiMon to find 14 systematic failures (e.g., "ignores quantifiers") of the CLIP text-encoder, each comprising hundreds of distinct inputs (e.g., "a shelf with a few/many books"). Because CLIP is the backbone for most state-of-the-art multimodal systems, these inputs produce failures in Midjourney 5.1, DALL-E, VideoFusion, and others. MultiMon can also steer towards failures relevant to specific use cases, such as self-driving cars. We see MultiMon as a step towards evaluation that autonomously explores the long tail of potential system failures. Code for MULTIMON is available at https://github.com/tsb0601/MultiMon.
- Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306, 2021.
- Anthropic. “Introducing Claude ‘’. https://www.anthropic.com/index/introducing-claude, 2023.
- On the dangers of stochastic parrots: Can language models be too big? In ACM Conference on Fairness, Accountability, and Transparency (FAccT), 2021.
- Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023.
- Stereotyping norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, 2021.
- Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Information Processing Systems (NeurIPS), pages 4349–4357, 2016.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Picking on the same person: Does algorithmic monoculture lead to outcome homogenization? In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/17a234c91f746d9625a75cf8a8731ee2-Abstract-Conference.html.
- A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.
- Introducing ChatGPT and Whisper APIs. OpenAI Blog, 2023. URL https://openai.com/blog/introducing-chatgpt-and-whisper-apis.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
- Extracting training data from large language models. In USENIX Security Symposium, volume 6, 2021.
- Extracting training data from diffusion models. arXiv preprint arXiv:2301.13188, 2023.
- Testing relational understanding in text-guided image generation. arXiv preprint arXiv:2208.00005, 2022.
- Allyson Ettinger. What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8:34–48, 2020.
- Domino: Discovering systematic errors with cross-modal embeddings. arXiv preprint arXiv:2203.14960, 2022.
- Adaptive testing of computer vision models. arXiv preprint arXiv:2212.02774, 2022.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862, 2019.
- Bias correction of learned generative models using likelihood-free importance weighting. Advances in neural information processing systems, 32, 2019.
- Defending against adversarial samples without security through obscurity. In 2018 IEEE International Conference on Data Mining (ICDM), pages 137–146. IEEE, 2018.
- Debiased large language models still associate muslims with uniquely violent acts. arXiv preprint arXiv:2208.04417, 2022.
- Distilling model failures as directions in latent space. arXiv preprint arXiv:2206.14754, 2022.
- Capturing failures of large language models via human cognitive biases. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
- Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
- Algorithmic monoculture and social welfare. Proc. Natl. Acad. Sci. USA, 118(22):e2018340118, 2021. doi: 10.1073/pnas.2018340118. URL https://doi.org/10.1073/pnas.2018340118.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Studying bias in gans through the lens of race. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII, pages 344–360. Springer, 2022.
- On measuring social biases in sentence encoders. arXiv preprint arXiv:1903.10561, 2019.
- Identification of systematic errors of image classifiers on rare subgroups. arXiv preprint arXiv:2303.05072, 2023.
- Midjourney. Version. https://docs.midjourney.com/docs/models, 2023a.
- Midjourney. Community guidelines. https://docs.midjourney.com/docs/community-guidelines, 2023b.
- Towards accountable ai: Hybrid human-machine analyses for characterizing system failure. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 6(1):126–135, 2018.
- OpenAI. Gpt-4 technical report, 2023.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022a.
- Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022b.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Improving language understanding by generative pre-training. Technical report, OpenAI, 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- The protection of information in computer systems. Proceedings of the IEEE, 63(9):1278–1308, 1975.
- The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326, 2019.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Explaining patterns in data with language models via interpretable autoprompting. arXiv preprint arXiv:2210.01848, 2022.
- Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.
- Stability.ai. Version. https://stability.ai/stable-diffusion, 2023.
- Mitigating gender bias in natural language processing: Literature review. arXiv preprint arXiv:1906.08976, 2019.
- Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Do nlp models know numbers? probing numeracy in embeddings. arXiv preprint arXiv:1909.07940, 2019.
- Learning adversary-resistant deep neural networks. arXiv preprint arXiv:1612.01401, 2016.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
- Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668, 2023.
- Discovering bugs in vision models using off-the-shelf image generation and captioning. arXiv preprint arXiv:2208.08831, 2022.
- When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=KRLUvxh8uaX.
- Diagnosing and rectifying vision models using language. arXiv preprint arXiv:2302.04269, 2023.
- Describing differences between text distributions with natural language. In International Conference on Machine Learning, pages 27099–27116. PMLR, 2022.
- Goal driven discovery of distributional differences via language descriptions. arXiv preprint arXiv:2302.14233, 2023.