Papers
Topics
Authors
Recent
2000 character limit reached

Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models

Published 11 Jul 2024 in cs.CL and cs.AI | (2407.08440v4)

Abstract: Although LLMs have demonstrated strong ability, they are further supposed to be controlled and guided by in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of capability of LLMs. However, no prior work has made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT). The experimental results show that through IRFT, LLMs can learn abstract rule-following abilities from purely synthetic data and then generalize to RuleBench. The data and code can be found at: https://anonymous.4open.science/r/LLM-rule-following-B3E3/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
  2. AI@Meta. 2024. Llama 3 model card.
  3. Theoremqa: A theorem-driven question answering dataset. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  4. Kcts: knowledge-constrained tree search decoding with token-level hallucination detection. arXiv preprint arXiv:2310.09044.
  5. What is an inference rule? The Journal of symbolic logic, 57(3):1018–1045.
  6. Case-based or rule-based: How do transformers do the math? arXiv preprint arXiv:2402.17709.
  7. Mistral 7b. arXiv preprint arXiv:2310.06825.
  8. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044.
  9. Chatrule: Mining logical rules with large language models for knowledge graph reasoning. arXiv preprint arXiv:2309.01538.
  10. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  11. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.
  12. Can llms follow simple rules? arXiv preprint arXiv:2311.04235.
  13. OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  14. Infobench: Evaluating instruction following ability in large language models. arXiv preprint arXiv:2401.03601.
  15. Emilio Ribes-Inesta. 2000. Instructions, rules, and abstraction: A misconstrued relation. Behavior and philosophy, pages 41–55.
  16. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366.
  17. Clutrr: A diagnostic benchmark for inductive reasoning from text. arXiv preprint arXiv:1908.06177.
  18. Expnote: Black-box large language models are better task solvers with experience notebook. arXiv preprint arXiv:2311.07032.
  19. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  20. Can llms reason with rules? logic scaffolding for stress-testing and improving llms. arXiv preprint arXiv:2402.11442.
  21. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  22. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  23. Cail2018: A large-scale legal dataset for judgment prediction. arXiv preprint arXiv:1807.02478.
  24. Failures pave the way: Enhancing large language models through tuning-free rule accumulation. arXiv preprint arXiv:2310.15746.
  25. Language models as inductive reasoners. arXiv preprint arXiv:2212.10923.
  26. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
  27. LLM-driven instruction following: Progresses and concerns. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 19–25, Singapore. Association for Computational Linguistics.
  28. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.
  29. Expel: Llm agents are experiential learners. arXiv preprint arXiv:2308.10144.
  30. Overview of cail2018: Legal judgment prediction competition. arXiv preprint arXiv:1810.05851.
  31. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. arXiv preprint arXiv:2104.04670.
  32. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.
  33. Large language models can learn rules. arXiv preprint arXiv:2310.07064.
Citations (2)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.