InFoBench: Evaluating Instruction Following Ability in Large Language Models (2401.03601v1)
Abstract: This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating LLMs' (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.
- Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. GitHub.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Alpacafarm: A simulation framework for methods that learn from human feedback.
- Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english?
- Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
- Gptscore: Evaluate as you desire.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Llama 2: Open foundation and fine-tuned chat models.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
- Followbench: A multi-level fine-grained constraints following benchmark for large language models.
- Alignment of language agents. arXiv preprint arXiv:2103.14659.
- Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality.
- Instruction-following evaluation through verbalizer manipulation.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- G-eval: Nlg evaluation using gpt-4 with better human alignment.
- Is prompt all you need? no. a comprehensive and broader view of instruction learning. arXiv preprint arXiv:2303.10475.
- MistralAI. 2023. Mixtral of experts. https://mistral.ai/news/mixtral-of-experts/. Accessed on: December 11, 2023.
- OpenAI. 2023a. Gpt-4. https://openai.com/research/gpt-4.
- OpenAI. 2023b. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
- Instruction tuning with gpt-4.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Gemini Team. 2023. Gemini: A family of highly capable multimodal models.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
- Is chatgpt a good nlg evaluator? a preliminary study.
- Scibench: Evaluating college-level scientific problem-solving abilities of large language models.
- Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Wizardlm: Empowering large language models to follow complex instructions.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Instruction-following evaluation for large language models.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.