Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models (2407.08440v4)

Published 11 Jul 2024 in cs.CL and cs.AI

Abstract: Although LLMs have demonstrated strong ability, they are further supposed to be controlled and guided by in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of capability of LLMs. However, no prior work has made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT). The experimental results show that through IRFT, LLMs can learn abstract rule-following abilities from purely synthetic data and then generalize to RuleBench. The data and code can be found at: https://anonymous.4open.science/r/LLM-rule-following-B3E3/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
  2. AI@Meta. 2024. Llama 3 model card.
  3. Theoremqa: A theorem-driven question answering dataset. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  4. Kcts: knowledge-constrained tree search decoding with token-level hallucination detection. arXiv preprint arXiv:2310.09044.
  5. What is an inference rule? The Journal of symbolic logic, 57(3):1018–1045.
  6. Case-based or rule-based: How do transformers do the math? arXiv preprint arXiv:2402.17709.
  7. Mistral 7b. arXiv preprint arXiv:2310.06825.
  8. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044.
  9. Chatrule: Mining logical rules with large language models for knowledge graph reasoning. arXiv preprint arXiv:2309.01538.
  10. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  11. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.
  12. Can llms follow simple rules? arXiv preprint arXiv:2311.04235.
  13. OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  14. Infobench: Evaluating instruction following ability in large language models. arXiv preprint arXiv:2401.03601.
  15. Emilio Ribes-Inesta. 2000. Instructions, rules, and abstraction: A misconstrued relation. Behavior and philosophy, pages 41–55.
  16. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366.
  17. Clutrr: A diagnostic benchmark for inductive reasoning from text. arXiv preprint arXiv:1908.06177.
  18. Expnote: Black-box large language models are better task solvers with experience notebook. arXiv preprint arXiv:2311.07032.
  19. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  20. Can llms reason with rules? logic scaffolding for stress-testing and improving llms. arXiv preprint arXiv:2402.11442.
  21. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  22. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  23. Cail2018: A large-scale legal dataset for judgment prediction. arXiv preprint arXiv:1807.02478.
  24. Failures pave the way: Enhancing large language models through tuning-free rule accumulation. arXiv preprint arXiv:2310.15746.
  25. Language models as inductive reasoners. arXiv preprint arXiv:2212.10923.
  26. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
  27. LLM-driven instruction following: Progresses and concerns. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 19–25, Singapore. Association for Computational Linguistics.
  28. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.
  29. Expel: Llm agents are experiential learners. arXiv preprint arXiv:2308.10144.
  30. Overview of cail2018: Legal judgment prediction competition. arXiv preprint arXiv:1810.05851.
  31. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. arXiv preprint arXiv:2104.04670.
  32. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.
  33. Large language models can learn rules. arXiv preprint arXiv:2310.07064.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Wangtao Sun (9 papers)
  2. Chenxiang Zhang (5 papers)
  3. Xueyou Zhang (1 paper)
  4. Ziyang Huang (23 papers)
  5. Haotian Xu (48 papers)
  6. Pei Chen (38 papers)
  7. Shizhu He (51 papers)
  8. Jun Zhao (469 papers)
  9. Kang Liu (207 papers)
  10. Xuanqing Yu (5 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.