Benchmarking Large Language Models on Controllable Generation under Diversified Instructions (2401.00690v1)
Abstract: While LLMs have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval.
- GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo. https://github.com/nomic-ai/gpt4all.
- Twitter Topic Classification. In Proceedings of the 29th International Conference on Computational Linguistics, 3386–3400.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
- Fine-grained controllable text generation using non-residual prompting. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 6837–6857.
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
- Scaling Instruction-Finetuned Language Models. arXiv:2210.11416.
- Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 320–335.
- Ekman, P. 1992. An argument for basic emotions. Cognition & emotion, 6(3-4): 169–200.
- The Vendi Score: A Diversity Evaluation Metric for Machine Learning. arXiv:2210.02410.
- RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2020, 3356–3369. Online: Association for Computational Linguistics.
- A Distributional Lens for Multi-Aspect Controllable Text Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 1023–1043. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
- Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations.
- The Curious Case of Neural Text Degeneration. In International Conference on Learning Representations.
- Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. arXiv:2212.09689.
- C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv:2305.08322.
- CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2306–2319. Dublin, Ireland: Association for Computational Linguistics.
- GeDi: Generative Discriminator Guided Sequence Generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, 4929–4952. Punta Cana, Dominican Republic: Association for Computational Linguistics.
- LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction. arXiv:2304.08460.
- Instruction-following Evaluation through Verbalizer Manipulation. arXiv:2307.10558.
- CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, 1823–1840.
- Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), 605–612.
- DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 6691–6706.
- Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints. In Vlachos, A.; and Augenstein, I., eds., Findings of the Association for Computational Linguistics: EACL 2023, 1982–2008. Dubrovnik, Croatia: Association for Computational Linguistics.
- QUARK: Controllable Text Generation with Reinforced Unlearning. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 27591–27609. Curran Associates, Inc.
- Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35: 27591–27609.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318.
- RWKV: Reinventing RNNs for the Transformer Era. arXiv:2305.13048.
- Instruction Tuning with GPT-4. arXiv:2304.03277.
- Controllable natural language generation with contrastive prefixes. arXiv preprint arXiv:2202.13257.
- Is ChatGPT a General-Purpose Natural Language Processing Task Solver? arXiv:2302.06476.
- Cold decoding: Energy-based constrained text generation with langevin dynamics. Advances in Neural Information Processing Systems, 35: 9538–9551.
- RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting. arXiv:2305.15685.
- Positional Encoding to Control Output Sequence Length. In Proceedings of NAACL-HLT, 3999–4004.
- Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford˙alpaca.
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837.
- ExpertPrompting: Instructing Large Language Models to be Distinguished Experts. arXiv:2305.14688.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- FUDGE: Controlled Text Generation With Future Discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3511–3535. Online: Association for Computational Linguistics.
- DisCup: Discriminator Cooperative Unlikelihood Prompt-tuning for Controllable Text Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 3392–3406.
- A survey of controllable text generation using transformer-based pre-trained language models. arXiv preprint arXiv:2201.05337.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Controlled text generation with natural language instructions. arXiv preprint arXiv:2304.14293.
- Yihan Chen (7 papers)
- Benfeng Xu (15 papers)
- Quan Wang (130 papers)
- Yi Liu (543 papers)
- Zhendong Mao (55 papers)