InFoBench: Evaluating Instruction Following Ability in Large Language Models (2401.03601v1)

Published 7 Jan 2024 in cs.CL and cs.AI

Abstract: This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating LLMs' (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.

References (49)

Citations (20)

View on Semantic Scholar

Summary

The paper introduces DRFR, a new metric that decomposes instructions to enhance reliable evaluation of LLMs.
It develops InFoBench, a benchmark with 500 diverse instructions and 2,250 decomposed questions for systematic testing.
The study shows GPT-4 as a cost-effective annotator, yielding insights for refining LLM performance in complex tasks.

Overview of InFoBench and DRFR

In the sphere of AI and NLP, evaluating the effectiveness of LLMs is crucial, especially regarding their ability to follow complex instructions. Traditional models often struggle with tasks that require nuanced understanding and compliance with specific directives. To address this challenge, a novel metric known as the Decomposed Requirements Following Ratio (DRFR) has been developed, which breaks down instructions into more manageable criteria. This granularity allows for a deeper analysis of LLMs' performance.

Benchmark Development and Evaluation Criteria

Along with DRFR, a new benchmark called InFoBench was created, featuring 500 varied instructions and 2,250 decomposed questions to test LLMs. The benchmark categorizes instructions into two difficulty levels: an Easy Set with basic requests and a Hard Set with more complex, multi-constraint scenarios. These two sets help to examine LLMs across different parameter requirements systematically. Following, an example of the categories of constraints used to evaluate the LLMs was presented, showcasing the diverse challenges LLMs are exposed to when being tested by this new benchmark.

Annotator Reliability Tests

The paper embarked on two key experiments. The first compared the DRFR with conventional Direct Scoring (DS) methods, revealing the former as more reliable due to its enhanced granularity. The second experiment evaluated the effectiveness of using various annotation sources like human experts, crowd-sourced workers, and GPT-4. GPT-4 was found to be a surprisingly accurate and cost-effective annotator, making it a potentially valuable tool for large-scale model evaluations.

Insights and Future Directions

The comprehensive evaluation of advanced LLMs revealed that although these models have made remarkable progress, they are still not flawless in following intricate instructions – a factor that's critical for practical applications. Closed-source models tended to outperform open-source models, perhaps due to richer data or more sophisticated algorithms. The paper also pointed to various other contributions such as a scalable evaluation metric and a comprehensive benchmark that could assist in the ongoing refinement of LLMs.

To conclude, this innovative approach to LLM evaluation offers a more precise, detailed, and cost-effective method than previous models. It highlights both the current capabilities and the potential future advancements for LLMs in understanding and executing complex instructions. Such evaluations are fundamental for advancing AI models that can more effectively serve a wide range of real-world applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/SongKaiqiang/status/1746704531520635261