Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models (2407.08440v4)
Abstract: Although LLMs have demonstrated strong ability, they are further supposed to be controlled and guided by in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of capability of LLMs. However, no prior work has made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT). The experimental results show that through IRFT, LLMs can learn abstract rule-following abilities from purely synthetic data and then generalize to RuleBench. The data and code can be found at: https://anonymous.4open.science/r/LLM-rule-following-B3E3/
- Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
- AI@Meta. 2024. Llama 3 model card.
- Theoremqa: A theorem-driven question answering dataset. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Kcts: knowledge-constrained tree search decoding with token-level hallucination detection. arXiv preprint arXiv:2310.09044.
- What is an inference rule? The Journal of symbolic logic, 57(3):1018–1045.
- Case-based or rule-based: How do transformers do the math? arXiv preprint arXiv:2402.17709.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044.
- Chatrule: Mining logical rules with large language models for knowledge graph reasoning. arXiv preprint arXiv:2309.01538.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
- Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.
- Can llms follow simple rules? arXiv preprint arXiv:2311.04235.
- OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Infobench: Evaluating instruction following ability in large language models. arXiv preprint arXiv:2401.03601.
- Emilio Ribes-Inesta. 2000. Instructions, rules, and abstraction: A misconstrued relation. Behavior and philosophy, pages 41–55.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366.
- Clutrr: A diagnostic benchmark for inductive reasoning from text. arXiv preprint arXiv:1908.06177.
- Expnote: Black-box large language models are better task solvers with experience notebook. arXiv preprint arXiv:2311.07032.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Can llms reason with rules? logic scaffolding for stress-testing and improving llms. arXiv preprint arXiv:2402.11442.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Cail2018: A large-scale legal dataset for judgment prediction. arXiv preprint arXiv:1807.02478.
- Failures pave the way: Enhancing large language models through tuning-free rule accumulation. arXiv preprint arXiv:2310.15746.
- Language models as inductive reasoners. arXiv preprint arXiv:2212.10923.
- Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
- LLM-driven instruction following: Progresses and concerns. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 19–25, Singapore. Association for Computational Linguistics.
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.
- Expel: Llm agents are experiential learners. arXiv preprint arXiv:2308.10144.
- Overview of cail2018: Legal judgment prediction competition. arXiv preprint arXiv:1810.05851.
- Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. arXiv preprint arXiv:2104.04670.
- Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.
- Large language models can learn rules. arXiv preprint arXiv:2310.07064.
- Wangtao Sun (9 papers)
- Chenxiang Zhang (5 papers)
- Xueyou Zhang (1 paper)
- Ziyang Huang (23 papers)
- Haotian Xu (48 papers)
- Pei Chen (38 papers)
- Shizhu He (51 papers)
- Jun Zhao (469 papers)
- Kang Liu (207 papers)
- Xuanqing Yu (5 papers)