Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability (2402.18667v1)

Published 28 Feb 2024 in cs.CL

Abstract: This paper presents FoFo, a pioneering benchmark for evaluating LLMs' (LLMs) ability to follow complex, domain-specific formats, a crucial yet underexamined capability for their application as AI agents. Despite LLMs' advancements, existing benchmarks fail to assess their format-following proficiency adequately. FoFo fills this gap with a diverse range of real-world formats and instructions, developed through an AI-Human collaborative method. Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across different domains. These insights suggest the need for specialized tuning for format-following skills and highlight FoFo's role in guiding the selection of domain-specific AI agents. FoFo is released here at https://github.com/SalesforceAIResearch/FoFo.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  2. Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662.
  3. Benchmarking large language models on controllable generation under diversified instructions. arXiv preprint arXiv:2401.00690.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/.
  5. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  6. The future landscape of large language models in medicine. Communications Medicine, 3(1):141.
  7. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
  8. The hl7 clinical document architecture. Journal of the American Medical Informatics Association, 8(6):552–569.
  9. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
  10. Lawbench: Benchmarking legal knowledge of large language models. arXiv preprint arXiv:2309.16289.
  11. S33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984.
  12. Gemini Team Google. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  13. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. arXiv preprint arXiv:2308.11462.
  14. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  15. Mistral 7b. arXiv preprint arXiv:2310.06825.
  16. Cong Jiang and Xiaolei Yang. 2023. Legal syllogism prompting: Teaching large language models for legal judgment prediction. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, pages 417–421.
  17. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
  18. Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support. In AMIA Annual Symposium Proceedings, volume 2023, page 1105. American Medical Informatics Association.
  19. Towards large language model-based personal agents in the enterprise: Current trends and open problems. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6909–6921.
  20. OpenAI. 2023a. Chatgpt. https://openai.com/chatgpt.
  21. OpenAI. 2023b. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  22. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  23. Creation and adoption of large language models in medicine. Jama, 330(9):866–869.
  24. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
  25. Large language models help humans verify truthfulness–except when they are convincingly wrong. arXiv preprint arXiv:2310.12558.
  26. Evaluating large language models on medical evidence summarization. npj Digital Medicine, 6(1):158.
  27. Large language models in medicine. Nature medicine, 29(8):1930–1940.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  29. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  30. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
  31. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  32. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  33. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
  34. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  35. Language agents as hackers: Evaluating cybersecurity skills with capture the flag. In Multi-Agent Security Workshop@ NeurIPS’23.
  36. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045.
  37. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  38. Trafficsafetygpt: Tuning a pre-trained large language model to a domain-specific expert in transportation safety. arXiv preprint arXiv:2307.15311.
  39. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.
  40. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Congying Xia (32 papers)
  2. Chen Xing (31 papers)
  3. Jiangshu Du (10 papers)
  4. Xinyi Yang (33 papers)
  5. Yihao Feng (35 papers)
  6. Ran Xu (89 papers)
  7. Wenpeng Yin (69 papers)
  8. Caiming Xiong (337 papers)
Citations (33)

Summary

Evaluating LLM Format-Following Capabilities: Insights from the FOF-O Benchmark

The paper, FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability, introduces a novel evaluation framework dedicated to assessing LLMs in terms of their ability to adhere to specified data formats. This capability is identified as crucial for the practical application of LLMs in various domains as AI agents, particularly when performing tasks requiring formal outputs, such as generating medical records or legal documents. The introduction of FOFO is significant as it fills a notable gap left by previous benchmarks, which predominantly focused on content generation and context-following, largely overlooking format adherence.

The FOFO benchmark was constructed using an AI-human collaborative methodology, ensuring coverage of a wide range of domain-specific formats. The core phases involved identifying pertinent domains and subdomains, developing domain-specific data formats, and generating detailed instructions that incorporate complex, format-oriented requirements.

Through an empirical examination involving both closed-source (e.g., GPT-4, PALM2) and open-source models (e.g., Llama 2, WizardLM), the paper makes several noteworthy observations:

  1. Performance Discrepancy: A significant gap was noted between the format-following accuracy of open-source models compared to their closed-source counterparts. Closed-source models consistently outperformed open-source ones, which may suggest that open-source models lag in specialized alignment fine-tuning necessary for format adherence.
  2. Independence from Content Generation: The paper emphasizes that format adherence operates independently of content generation performance, underscoring that models excelling in content generation benchmarks do not necessarily perform well in format-following tasks. This decoupling suggests that dedicated efforts are required to hone this particular capability of LLMs.
  3. Domain Variability: The paper highlights that format adherence accuracy varies widely across different domains, with some models performing better in certain domains over others. This suggests that format-following capabilities are not easily transferable across different contexts or tasks.

The paper posits that these findings have dual implications: the need for specialized fine-tuning for format adherence beyond traditional instruction-tuning techniques applied to LLMs and the potential usage of the FOFO benchmark as a guiding tool for selecting specific models as foundation agents in various domains. The benchmark serves not only as a probing tool but also iterates the necessity for ongoing model refinements to align with practical, real-world applications.

Finally, the paper discusses the cost analysis of utilizing GPT-4 for both the generation and evaluation stages within FOFO, along with human evaluation alignment to ensure credibility and reliability of format correctness assessment. Future development paths include refining automation processes to minimize cost and maximize efficiency.

In summary, the introduction of the FOFO benchmark elucidates the importance of format-following capabilities in LLMs, advocates for specialized training paradigms, and provides an insightful tool for the community to harness AI more effectively within domain-specific applications.