Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InFoBench: Evaluating Instruction Following Ability in Large Language Models (2401.03601v1)

Published 7 Jan 2024 in cs.CL and cs.AI

Abstract: This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating LLMs' (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. GitHub.
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  9. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  11. Alpacafarm: A simulation framework for methods that learn from human feedback.
  12. Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english?
  13. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
  14. Gptscore: Evaluate as you desire.
  15. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
  16. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  17. Llama 2: Open foundation and fine-tuned chat models.
  18. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
  19. Followbench: A multi-level fine-grained constraints following benchmark for large language models.
  20. Alignment of language agents. arXiv preprint arXiv:2103.14659.
  21. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality.
  22. Instruction-following evaluation through verbalizer manipulation.
  23. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  24. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  25. G-eval: Nlg evaluation using gpt-4 with better human alignment.
  26. Is prompt all you need? no. a comprehensive and broader view of instruction learning. arXiv preprint arXiv:2303.10475.
  27. MistralAI. 2023. Mixtral of experts. https://mistral.ai/news/mixtral-of-experts/. Accessed on: December 11, 2023.
  28. OpenAI. 2023a. Gpt-4. https://openai.com/research/gpt-4.
  29. OpenAI. 2023b. Gpt-4 technical report.
  30. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  31. Instruction tuning with gpt-4.
  32. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  33. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  34. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  35. Gemini Team. 2023. Gemini: A family of highly capable multimodal models.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  37. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  38. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  39. Is chatgpt a good nlg evaluator? a preliminary study.
  40. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.
  41. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.
  42. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  43. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  44. Wizardlm: Empowering large language models to follow complex instructions.
  45. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  46. A survey of large language models. arXiv preprint arXiv:2303.18223.
  47. Judging llm-as-a-judge with mt-bench and chatbot arena.
  48. Instruction-following evaluation for large language models.
  49. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Citations (20)

Summary

  • The paper introduces DRFR, a new metric that decomposes instructions to enhance reliable evaluation of LLMs.
  • It develops InFoBench, a benchmark with 500 diverse instructions and 2,250 decomposed questions for systematic testing.
  • The study shows GPT-4 as a cost-effective annotator, yielding insights for refining LLM performance in complex tasks.

Overview of InFoBench and DRFR

In the sphere of AI and NLP, evaluating the effectiveness of LLMs is crucial, especially regarding their ability to follow complex instructions. Traditional models often struggle with tasks that require nuanced understanding and compliance with specific directives. To address this challenge, a novel metric known as the Decomposed Requirements Following Ratio (DRFR) has been developed, which breaks down instructions into more manageable criteria. This granularity allows for a deeper analysis of LLMs' performance.

Benchmark Development and Evaluation Criteria

Along with DRFR, a new benchmark called InFoBench was created, featuring 500 varied instructions and 2,250 decomposed questions to test LLMs. The benchmark categorizes instructions into two difficulty levels: an Easy Set with basic requests and a Hard Set with more complex, multi-constraint scenarios. These two sets help to examine LLMs across different parameter requirements systematically. Following, an example of the categories of constraints used to evaluate the LLMs was presented, showcasing the diverse challenges LLMs are exposed to when being tested by this new benchmark.

Annotator Reliability Tests

The paper embarked on two key experiments. The first compared the DRFR with conventional Direct Scoring (DS) methods, revealing the former as more reliable due to its enhanced granularity. The second experiment evaluated the effectiveness of using various annotation sources like human experts, crowd-sourced workers, and GPT-4. GPT-4 was found to be a surprisingly accurate and cost-effective annotator, making it a potentially valuable tool for large-scale model evaluations.

Insights and Future Directions

The comprehensive evaluation of advanced LLMs revealed that although these models have made remarkable progress, they are still not flawless in following intricate instructions – a factor that's critical for practical applications. Closed-source models tended to outperform open-source models, perhaps due to richer data or more sophisticated algorithms. The paper also pointed to various other contributions such as a scalable evaluation metric and a comprehensive benchmark that could assist in the ongoing refinement of LLMs.

To conclude, this innovative approach to LLM evaluation offers a more precise, detailed, and cost-effective method than previous models. It highlights both the current capabilities and the potential future advancements for LLMs in understanding and executing complex instructions. Such evaluations are fundamental for advancing AI models that can more effectively serve a wide range of real-world applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com