FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability (2402.18667v1)

Published 28 Feb 2024 in cs.CL

Abstract: This paper presents FoFo, a pioneering benchmark for evaluating LLMs' (LLMs) ability to follow complex, domain-specific formats, a crucial yet underexamined capability for their application as AI agents. Despite LLMs' advancements, existing benchmarks fail to assess their format-following proficiency adequately. FoFo fills this gap with a diverse range of real-world formats and instructions, developed through an AI-Human collaborative method. Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across different domains. These insights suggest the need for specialized tuning for format-following skills and highlight FoFo's role in guiding the selection of domain-specific AI agents. FoFo is released here at https://github.com/SalesforceAIResearch/FoFo.

References (40)

Authors (8)

Congying Xia (32 papers)
Chen Xing (31 papers)
Jiangshu Du (10 papers)
Xinyi Yang (33 papers)
Yihao Feng (35 papers)
Ran Xu (89 papers)
Wenpeng Yin (69 papers)
Caiming Xiong (337 papers)

Citations (33)

View on Semantic Scholar

Summary

Evaluating LLM Format-Following Capabilities: Insights from the FOF-O Benchmark

The paper, FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability, introduces a novel evaluation framework dedicated to assessing LLMs in terms of their ability to adhere to specified data formats. This capability is identified as crucial for the practical application of LLMs in various domains as AI agents, particularly when performing tasks requiring formal outputs, such as generating medical records or legal documents. The introduction of FOFO is significant as it fills a notable gap left by previous benchmarks, which predominantly focused on content generation and context-following, largely overlooking format adherence.

The FOFO benchmark was constructed using an AI-human collaborative methodology, ensuring coverage of a wide range of domain-specific formats. The core phases involved identifying pertinent domains and subdomains, developing domain-specific data formats, and generating detailed instructions that incorporate complex, format-oriented requirements.

Through an empirical examination involving both closed-source (e.g., GPT-4, PALM2) and open-source models (e.g., Llama 2, WizardLM), the paper makes several noteworthy observations:

Performance Discrepancy: A significant gap was noted between the format-following accuracy of open-source models compared to their closed-source counterparts. Closed-source models consistently outperformed open-source ones, which may suggest that open-source models lag in specialized alignment fine-tuning necessary for format adherence.
Independence from Content Generation: The paper emphasizes that format adherence operates independently of content generation performance, underscoring that models excelling in content generation benchmarks do not necessarily perform well in format-following tasks. This decoupling suggests that dedicated efforts are required to hone this particular capability of LLMs.
Domain Variability: The paper highlights that format adherence accuracy varies widely across different domains, with some models performing better in certain domains over others. This suggests that format-following capabilities are not easily transferable across different contexts or tasks.

The paper posits that these findings have dual implications: the need for specialized fine-tuning for format adherence beyond traditional instruction-tuning techniques applied to LLMs and the potential usage of the FOFO benchmark as a guiding tool for selecting specific models as foundation agents in various domains. The benchmark serves not only as a probing tool but also iterates the necessity for ongoing model refinements to align with practical, real-world applications.

Finally, the paper discusses the cost analysis of utilizing GPT-4 for both the generation and evaluation stages within FOFO, along with human evaluation alignment to ensure credibility and reliability of format correctness assessment. Future development paths include refining automation processes to minimize cost and maximize efficiency.

In summary, the introduction of the FOFO benchmark elucidates the importance of format-following capabilities in LLMs, advocates for specialized training paradigms, and provides an insightful tool for the community to harness AI more effectively within domain-specific applications.

PDF Markdown

Related Papers

Find Related Papers