Papers
Topics
Authors
Recent
Search
2000 character limit reached

PRobELM: Plausibility Ranking Evaluation for Language Models

Published 4 Apr 2024 in cs.CL | (2404.03818v4)

Abstract: This paper introduces PRobELM (Plausibility Ranking Evaluation for LLMs), a benchmark designed to assess LLMs' ability to discern more plausible from less plausible scenarios through their parametric knowledge. While benchmarks such as TruthfulQA emphasise factual accuracy or truthfulness, and others such as COPA explore plausible scenarios without explicitly incorporating world knowledge, PRobELM seeks to bridge this gap by evaluating models' capabilities to prioritise plausible scenarios that leverage world knowledge over less plausible alternatives. This design allows us to assess the potential of LLMs for downstream use cases such as literature-based discovery where the focus is on identifying information that is likely but not yet known. Our benchmark is constructed from a dataset curated from Wikidata edit histories, tailored to align the temporal bounds of the training data for the evaluated models. PRobELM facilitates the evaluation of LLMs across multiple prompting types, including statement, text completion, and question-answering. Experiments with 10 models of various sizes and architectures on the relationship between model scales, training recency, and plausibility performance, reveal that factual accuracy does not directly correlate with plausibility performance and that up-to-date training data enhances plausibility assessment across different model architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. PaLM 2 Technical Report. arXiv preprint arXiv:2305.10403, 2023. URL https://arxiv.org/abs/2305.10403.
  2. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023. URL https://proceedings.mlr.press/v202/biderman23a/biderman23a.pdf.
  3. PaLM: Scaling Language Modeling with Pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL https://jmlr.org/papers/volume24/22-1144/22-1144.pdf.
  4. Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457, 2018. URL https://arxiv.org/abs/1803.05457.
  5. A Survey on Literature Based Discovery Approaches in Biomedical Domain. Journal of Biomedical Informatics, 93:103141, 2019. ISSN 1532-0464. doi: https://doi.org/10.1016/j.jbi.2019.103141. URL https://www.sciencedirect.com/science/article/pii/S1532046419300590.
  6. OLMo: Accelerating the Science of Language Models. arXiv preprint arXiv:2402.00838, 2024. URL https://arxiv.org/abs/2402.00838.
  7. Large Language Models Are Zero-Shot Time Series Forecasters. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=md68e8iZK1.
  8. Reasoning with Language Model is Planning with World Model. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  8154–8173, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.507. URL https://aclanthology.org/2023.emnlp-main.507.
  9. TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  6237–6250, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.418. URL https://aclanthology.org/2022.emnlp-main.418.
  10. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12):1–38, 2023. URL https://dl.acm.org/doi/10.1145/3571730.
  11. FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  14397–14413, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.890. URL https://aclanthology.org/2023.emnlp-main.890.
  12. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  6449–6464, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.397. URL https://aclanthology.org/2023.emnlp-main.397.
  13. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  14. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  12076–12100, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.741. URL https://aclanthology.org/2023.emnlp-main.741.
  15. Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=EmQSOi1X2f.
  16. Language Models are Unsupervised Multitask Learners. OpenAI blog, 1(8):9, 2019.
  17. Choice of Plausible alternatives: An Evaluation of Commonsense Causal Reasoning. In 2011 AAAI Spring Symposium Series, 2011. URL https://cdn.aaai.org/ocs/2418/2418-10878-1-PB.pdf.
  18. Winogrande: An Adversarial Winograd Schema Challenge at Scale. Communications of the ACM, 64(9):99–106, 2021. URL https://dl.acm.org/doi/10.1145/3474381.
  19. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. working paper or preprint, November 2023. URL https://inria.hal.science/hal-03850124.
  20. Llama: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971, 2023. URL https://arxiv.org/abs/2302.13971.
  21. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi (eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446.
  22. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf.
  23. SciMON: Scientific Inspiration Machines Optimized for Novelty. arXiv preprint arXiv:2305.14259, 2023. URL https://arxiv.org/abs/2305.14259.
  24. WebIE: Faithful and Robust Information Extraction on the Web. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7734–7755, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.428. URL https://aclanthology.org/2023.acl-long.428.
  25. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv preprint arXiv:2401.11817, 2024. URL https://arxiv.org/abs/2401.11817.
  26. Benchmarking LLMs via Uncertainty Quantification. arXiv preprint arXiv:2401.12794, 2024. URL https://arxiv.org/abs/2401.12794.
  27. KoLA: Carefully Benchmarking World Knowledge of Large Language Models. arXiv preprint arXiv:2306.09296, 2023. URL https://arxiv.org/abs/2306.09296.
  28. HellaSwag: Can a Machine Really Finish Your Sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.