Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following (2410.15553v2)

Published 21 Oct 2024 in cs.CL

Abstract: LLMs have demonstrated impressive capabilities in various tasks, including instruction following, which is crucial for aligning model outputs with user expectations. However, evaluating LLMs' ability to follow instructions remains challenging due to the complexity and subjectivity of human language. Current benchmarks primarily focus on single-turn, monolingual instructions, which do not adequately reflect the complexities of real-world applications that require handling multi-turn and multilingual interactions. To address this gap, we introduce Multi-IF, a new benchmark designed to assess LLMs' proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a hybrid framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn sequences and translating the English prompts into another 7 languages, resulting in a dataset of 4,501 multilingual conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. All the models tested showed a higher rate of failure in executing instructions correctly with each additional turn. For example, o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy over all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities. We release Multi-IF prompts and the evaluation code base to encourage further research in this critical area.

PDF HTML Abstract

Analyzing Multi-Turn and Multilingual Instruction Following in LLMs

The paper "Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instruction Following" presents an evaluative framework designed to measure the capabilities of LLMs in comprehending and executing complex, multilingual, and multi-turn instructions. These features are critical in applicability within real-world scenarios, where interactions are inherently multifaceted and involve numerous languages.

Key Contributions

This work introduces Multi-IF, addressing the inadequacies in existing benchmarks which predominantly focus on single-turn, monolingual instructions. By extending on IFEval, the research provides a dataset of 4,501 multilingual conversations, each involving three turns and covering eight languages. Such a benchmark is imperative for evaluating LLMs used in applications like multilingual chatbots and global user-focused AI systems.

Evaluation Findings

The benchmark was employed to assess 14 state-of-the-art LLMs, including models from OpenAI, Meta, and more. Here are the principal observations:

Increased Complexity: Multi-IF is more challenging than existing benchmarks. The models showed a consistent decline in accuracy with each additional conversational turn, illustrating the complexity introduced by multi-turn instructions. For instance, OpenAI's o1-preview model experienced a decrease in accuracy from 0.877 in the first turn to 0.707 in the third.
Language Variability: Non-Latin script languages, such as Hindi and Chinese, consistently resulted in higher error rates compared to English and other Latin-based languages. This suggests a need for significant improvements in multilingual capabilities.
Instruction Forgetting: A common issue identified was the degradation of adherence to previously followed instructions as the number of turns increased. This reflects challenges in maintaining context and directives through extended interactions.
Error Recovery: The ability of models to correct earlier errors in subsequent turns varied, with OpenAI models showing higher error correction rates, likely due to hidden chain-of-thought capabilities.

Analytical Insights

The notion of instruction following in multilingual and multi-turn contexts reveals deeper performance layers of LLMs. It helps delineate between simple compliance with single-turn directions and robustness in complex conversational settings. The findings underline the importance of improving LLM architectures to handle this complexity, potentially suggesting the incorporation of advanced memory networks or reinforcement learning-driven fine-tuning.

Implications and Future Directions

The deployment of Multi-IF sets a new standard in LLM evaluation for real-world applications. As AI continues to integrate into daily life, especially in areas such as customer support and global communication, the ability to follow complex instructions across different languages and turns is crucial. The knowledge derived from this benchmark can help guide future LLM developments, suggesting paths to address identified weaknesses and improve compliance with nuanced, multi-contextual instructions.

Moreover, the implications reach theoretical advancements in understanding the cognitive parallels to multi-turn processing and contextual retention. Future research may benefit from exploring how integrating insights from human cognitive science and memory systems can bolster LLM capabilities. Additionally, developing adaptive frameworks for automatic, real-time translation and multilingual context management will be pivotal in achieving equitable AI language interactions worldwide.

In conclusion, the Multi-IF benchmark prominently positions itself as a comprehensive tool for improving the intricate capabilities of LLMs in multilingual and multi-turn contexts, casting light on existing challenges and driving progress in AI's complex task execution.

PDF Markdown Bookmark Chat (Pro)

Authors (19)

Yun He (26 papers)
Di Jin (104 papers)
Chaoqi Wang (16 papers)
Chloe Bi (4 papers)
Karishma Mandyam (4 papers)
Hejia Zhang (24 papers)
Chen Zhu (103 papers)
Ning Li (174 papers)
Tengyu Xu (27 papers)
Hongjiang Lv (2 papers)
Shruti Bhosale (18 papers)
Chenguang Zhu (100 papers)
Karthik Abinav Sankararaman (25 papers)
Eryk Helenowski (6 papers)
Melanie Kambadur (11 papers)
Aditya Tayade (2 papers)
Hao Ma (116 papers)
Han Fang (61 papers)
Sinong Wang (45 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/Han_Fang_/status/1848583040110313480

https://twitter.com/xlr8harder/status/1884911098127647051