FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability (2402.18667v1)
Abstract: This paper presents FoFo, a pioneering benchmark for evaluating LLMs' (LLMs) ability to follow complex, domain-specific formats, a crucial yet underexamined capability for their application as AI agents. Despite LLMs' advancements, existing benchmarks fail to assess their format-following proficiency adequately. FoFo fills this gap with a diverse range of real-world formats and instructions, developed through an AI-Human collaborative method. Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across different domains. These insights suggest the need for specialized tuning for format-following skills and highlight FoFo's role in guiding the selection of domain-specific AI agents. FoFo is released here at https://github.com/SalesforceAIResearch/FoFo.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662.
- Benchmarking large language models on controllable generation under diversified instructions. arXiv preprint arXiv:2401.00690.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- The future landscape of large language models in medicine. Communications Medicine, 3(1):141.
- Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
- The hl7 clinical document architecture. Journal of the American Medical Informatics Association, 8(6):552–569.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
- Lawbench: Benchmarking legal knowledge of large language models. arXiv preprint arXiv:2309.16289.
- S33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984.
- Gemini Team Google. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. arXiv preprint arXiv:2308.11462.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Cong Jiang and Xiaolei Yang. 2023. Legal syllogism prompting: Teaching large language models for legal judgment prediction. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, pages 417–421.
- Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
- Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support. In AMIA Annual Symposium Proceedings, volume 2023, page 1105. American Medical Informatics Association.
- Towards large language model-based personal agents in the enterprise: Current trends and open problems. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6909–6921.
- OpenAI. 2023a. Chatgpt. https://openai.com/chatgpt.
- OpenAI. 2023b. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Creation and adoption of large language models in medicine. Jama, 330(9):866–869.
- Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
- Large language models help humans verify truthfulness–except when they are convincingly wrong. arXiv preprint arXiv:2310.12558.
- Evaluating large language models on medical evidence summarization. npj Digital Medicine, 6(1):158.
- Large language models in medicine. Nature medicine, 29(8):1930–1940.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Language agents as hackers: Evaluating cybersecurity skills with capture the flag. In Multi-Agent Security Workshop@ NeurIPS’23.
- Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Trafficsafetygpt: Tuning a pre-trained large language model to a domain-specific expert in transportation safety. arXiv preprint arXiv:2307.15311.
- Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.
- Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854.
- Congying Xia (32 papers)
- Chen Xing (31 papers)
- Jiangshu Du (10 papers)
- Xinyi Yang (33 papers)
- Yihao Feng (35 papers)
- Ran Xu (89 papers)
- Wenpeng Yin (69 papers)
- Caiming Xiong (337 papers)