Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Large Language Models: A Comprehensive Survey (2310.19736v3)

Published 30 Oct 2023 in cs.CL and cs.AI
Evaluating Large Language Models: A Comprehensive Survey

Abstract: LLMs have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.

Overview of LLM Evaluation Frameworks

The paper, "Evaluating LLMs: A Comprehensive Survey," offers a detailed exploration into the evaluation systems and methodologies for LLMs. This survey categorizes evaluations into three primary groups: knowledge and capability evaluation, alignment evaluation, and safety evaluation. It aims to provide researchers with structured insights into the challenges and performance of LLMs across various specialized domains.

Knowledge and Capability Evaluation

The paper discusses the significance of evaluating LLMs’ knowledge and reasoning capabilities, emphasizing diverse methods to assess aspects such as question answering, knowledge completion, and reasoning skills. Datasets like SQuAD and benchmarks like MMLU are used to examine capabilities, highlighting the importance of dynamic and comprehensive evaluations. The paper stresses the need for evaluating tool learning and manipulation, illustrating this with benchmarks like API-Bank for assessing how models interact with tools to perform tasks effectively.

Alignment Evaluation

The discussion on alignment evaluation focuses on ensuring that LLMs produce outputs aligned with ethical and moral standards. This includes evaluating biases, toxicity, and truthfulness. The survey describes datasets and metrics used to assess these factors, such as Social Chemistry 101 and REAL Toxicity Prompts. It emphasizes the necessity of refining LLMs to minimize societal biases and misinformation, reinforcing the models’ alignment with human values through rigorous testing.

Safety Evaluation

The survey highlights safety evaluation by categorizing it into robustness and risk assessment. Robustness evaluation examines how LLMs handle adversarial inputs and unexpected scenarios, using tools like PromptBench. Risk assessment focuses on evaluating the potential for harmful behaviors, such as power-seeking, using frameworks like AgentBench. The goal is to develop systems resistant to adversarial manipulation and unintentional harmful outputs.

Specialized Domain Applications

The paper extends the evaluation discourse to specialized domains, including medicine, finance, and education, emphasizing the application-specific challenges and benchmarks. For instance, in the medical field, LLMs are tested against standardized exams like the USMLE to ensure reliability in clinical decision support.

Holistic Evaluation Approaches

The survey also presents holistic evaluation frameworks like HELM and OpenAI Evals, which integrate multiple dimensions of assessment, including comprehensiveness, robustness, and alignment. These benchmarks aim to capture a full spectrum of LLM capabilities and facilitate an understanding of their performance across diverse and complex scenarios.

Future Directions

The survey speculates on future directions, advocating for evaluations that are dynamic, comprehensive, and centered on real-world applications. It emphasizes enhancement-oriented evaluations that do not merely benchmark capabilities but also identify weaknesses, providing avenues for improvements. This future-oriented approach aims to align the evolution of LLMs with societal needs and ethical standards, promoting safer and more effective AI deployment.

The paper serves as a vital resource for understanding the complexity and breadth of LLM evaluation. By offering a structured taxonomy and highlighting specific benchmarks and methodologies, it provides a foundation for advancing AI research and development, ensuring that LLM evolution aligns with societal benefits and ethical standards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (393)
  1. Overview of the medical question answering task at TREC 2017 liveqa. In Ellen M. Voorhees and Angela Ellis (eds.), Proceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg, Maryland, USA, November 15-17, 2017, volume 500-324 of NIST Special Publication. National Institute of Standards and Technology (NIST), 2017. URL https://trec.nist.gov/pubs/trec26/papers/Overview-QA.pdf.
  2. Benchmarking safe exploration in deep reinforcement learning. 2019. URL https://api.semanticscholar.org/CorpusID:208283920.
  3. mface: Multilingual summarization with factual consistency evaluation. CoRR, abs/2212.10622, 2022. doi: 10.48550/arXiv.2212.10622. URL https://doi.org/10.48550/arXiv.2212.10622.
  4. A causal framework for explaining the predictions of black-box sequence-to-sequence models. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp.  412–421. Association for Computational Linguistics, 2017. doi: 10.18653/v1/d17-1042. URL https://doi.org/10.18653/v1/d17-1042.
  5. Frontier AI regulation: Managing emerging risks to public safety. CoRR, abs/2307.03718, 2023. doi: 10.48550/arXiv.2307.03718. URL https://doi.org/10.48550/arXiv.2307.03718.
  6. Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings. Ophthalmology Science, 3(4):100324, 2023. ISSN 2666-9145. doi: https://doi.org/10.1016/j.xops.2023.100324. URL https://www.sciencedirect.com/science/article/pii/S2666914523000568.
  7. Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063, 2019.
  8. Have llms advanced enough? A challenging problem solving benchmark for large language models. CoRR, abs/2305.15074, 2023. doi: 10.48550/arXiv.2305.15074. URL https://doi.org/10.48550/arXiv.2305.15074.
  9. A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861, 2021. URL https://arxiv.org/abs/2112.00861.
  10. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.
  11. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Internal Medicine, 183(6):589–596, 06 2023. ISSN 2168-6106. doi: 10.1001/jamainternmed.2023.1838. URL https://doi.org/10.1001/jamainternmed.2023.1838.
  12. The internal state of an LLM knows when its lying. CoRR, abs/2304.13734, 2023. doi: 10.48550/arXiv.2304.13734. URL https://doi.org/10.48550/arXiv.2304.13734.
  13. Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073, 2022. doi: 10.48550/arXiv.2212.08073. URL https://doi.org/10.48550/arXiv.2212.08073.
  14. Longbench: A bilingual, multitask benchmark for long context understanding. CoRR, abs/2308.14508, 2023a. doi: 10.48550/arXiv.2308.14508. URL https://doi.org/10.48550/arXiv.2308.14508.
  15. Benchmarking foundation models with language-model-as-an-examiner. CoRR, abs/2306.04181, 2023b. doi: 10.48550/arXiv.2306.04181. URL https://doi.org/10.48550/arXiv.2306.04181.
  16. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. CoRR, abs/2302.04023, 2023. doi: 10.48550/arXiv.2302.04023. URL https://doi.org/10.48550/arXiv.2302.04023.
  17. Findings of the WMT 2019 biomedical translation shared task: Evaluation for MEDLINE abstracts and biomedical terminologies. In Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana L. Neves, Matt Post, Marco Turchi, and Karin Verspoor (eds.), Proceedings of the Fourth Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 - Volume 3: Shared Task Papers, Day 2, pp.  29–53. Association for Computational Linguistics, 2019. doi: 10.18653/V1/W19-5403. URL https://doi.org/10.18653/v1/w19-5403.
  18. On the dangers of stochastic parrots: Can language models be too big? In Madeleine Clare Elish, William Isaac, and Richard S. Zemel (eds.), FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021, pp. 610–623. ACM, 2021. doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922.
  19. Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. CoRR, abs/2303.16421, 2023. doi: 10.48550/arXiv.2303.16421. URL https://doi.org/10.48550/arXiv.2303.16421.
  20. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  7432–7439. AAAI Press, 2020. URL https://ojs.aaai.org/index.php/AAAI/article/view/6239.
  21. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. 2021. URL https://api.semanticscholar.org/CorpusID:245758737.
  22. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pp. 95–136, virtual+Dublin, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.bigscience-1.9. URL https://aclanthology.org/2022.bigscience-1.9.
  23. Can GPT-3 perform statutory reasoning? In Matthias Grabmair, Francisco Andrade, and Paulo Novais (eds.), Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL 2023, Braga, Portugal, June 19-23, 2023, pp. 22–31. ACM, 2023. doi: 10.1145/3594536.3595163. URL https://doi.org/10.1145/3594536.3595163.
  24. Stereotyping norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp.  1004–1015. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.81. URL https://doi.org/10.18653/v1/2021.acl-long.81.
  25. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp.  4349–4357, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html.
  26. Gpt takes the bar exam. arXiv preprint arXiv:2212.14402, 2022.
  27. Improving language models by retrieving from trillions of tokens. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  2206–2240. PMLR, 2022. URL https://proceedings.mlr.press/v162/borgeaud22a.html.
  28. Nuanced metrics for measuring unintended bias with real data for text classification. In Sihem Amer-Yahia, Mohammad Mahdian, Ashish Goel, Geert-Jan Houben, Kristina Lerman, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (eds.), Companion of The 2019 World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, pp.  491–500. ACM, 2019. doi: 10.1145/3308560.3317593. URL https://doi.org/10.1145/3308560.3317593.
  29. Analysis of moral judgement on reddit. CoRR, abs/2101.07664, 2021. URL https://arxiv.org/abs/2101.07664.
  30. A large annotated corpus for learning natural language inference. In Lluís Màrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp.  632–642. The Association for Computational Linguistics, 2015. doi: 10.18653/v1/d15-1075. URL https://doi.org/10.18653/v1/d15-1075.
  31. Finding microaggressions in the wild: A case for locating elusive phenomena in social media posts. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp.  1664–1674. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1176. URL https://doi.org/10.18653/v1/D19-1176.
  32. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  33. Large language models as tool makers. CoRR, abs/2305.17126, 2023. doi: 10.48550/arXiv.2305.17126. URL https://doi.org/10.48550/arXiv.2305.17126.
  34. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
  35. Isanette: A common and common sense knowledge base for opinion mining. In Myra Spiliopoulou, Haixun Wang, Diane J. Cook, Jian Pei, Wei Wang, Osmar R. Zaïane, and Xindong Wu (eds.), Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, Vancouver, BC, Canada, December 11, 2011, pp.  315–322. IEEE Computer Society, 2011. doi: 10.1109/ICDMW.2011.106. URL https://doi.org/10.1109/ICDMW.2011.106.
  36. Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  3340–3354. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.236. URL https://doi.org/10.18653/v1/2022.acl-long.236.
  37. Shuyang Cao and Lu Wang. CLIFF: contrastive learning for improving faithfulness and factuality in abstractive summarization. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp.  6633–6649. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.532. URL https://doi.org/10.18653/v1/2021.emnlp-main.532.
  38. Toward gender-inclusive coreference resolution. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 4568–4595. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.418. URL https://doi.org/10.18653/v1/2020.acl-main.418.
  39. Joseph Carlsmith. Is power-seeking AI an existential risk? CoRR, abs/2206.13353, 2022. doi: 10.48550/arXiv.2206.13353. URL https://doi.org/10.48550/arXiv.2206.13353.
  40. LEGAL-BERT: the muppets straight out of law school. CoRR, abs/2010.02559, 2020. URL https://arxiv.org/abs/2010.02559.
  41. Fairlex: A multilingual benchmark for evaluating fairness in legal text processing. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  4389–4406. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.301. URL https://doi.org/10.18653/v1/2022.acl-long.301.
  42. Towards the scalable evaluation of cooperativeness in language models. CoRR, abs/2303.13360, 2023. doi: 10.48550/arXiv.2303.13360. URL https://doi.org/10.48550/arXiv.2303.13360.
  43. A survey on evaluation of large language models. CoRR, abs/2307.03109, 2023. doi: 10.48550/arXiv.2307.03109. URL https://doi.org/10.48550/arXiv.2307.03109.
  44. How is chatgpt’s behavior changing over time? CoRR, abs/2307.09009, 2023a. doi: 10.48550/arXiv.2307.09009. URL https://doi.org/10.48550/arXiv.2307.09009.
  45. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  46. Evaluating factual consistency of summaries with large language models. CoRR, abs/2305.14069, 2023b. doi: 10.48550/arXiv.2305.14069. URL https://doi.org/10.48550/arXiv.2305.14069.
  47. Hybridqa: A dataset of multi-hop question answering over tabular and textual data. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pp. 1026–1036. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.91. URL https://doi.org/10.18653/v1/2020.findings-emnlp.91.
  48. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. CoRR, abs/2211.12588, 2022a. doi: 10.48550/arXiv.2211.12588. URL https://doi.org/10.48550/arXiv.2211.12588.
  49. Do models explain themselves? counterfactual simulatability of natural language explanations. CoRR, abs/2307.08678, 2023c. doi: 10.48550/arXiv.2307.08678. URL https://doi.org/10.48550/arXiv.2307.08678.
  50. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. CoRR, abs/2304.00723, 2023d. doi: 10.48550/arXiv.2304.00723. URL https://doi.org/10.48550/arXiv.2304.00723.
  51. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  6279–6292. Association for Computational Linguistics, 2022b. doi: 10.18653/V1/2022.EMNLP-MAIN.421. URL https://doi.org/10.18653/v1/2022.emnlp-main.421.
  52. Evaluating hallucinations in chinese large language models. CoRR, abs/2310.03368, 2023. doi: 10.48550/ARXIV.2310.03368. URL https://doi.org/10.48550/arXiv.2310.03368.
  53. Factool: Factuality detection in generative AI - A tool augmented framework for multi-task and multi-domain scenarios. CoRR, abs/2307.13528, 2023. doi: 10.48550/arXiv.2307.13528. URL https://doi.org/10.48550/arXiv.2307.13528.
  54. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  55. Chatgpt goes to law school. Available at SSRN, 2023.
  56. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  57. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
  58. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
  59. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
  60. Evaluating language models for mathematics through interactions. CoRR, abs/2306.01694, 2023. doi: 10.48550/arXiv.2306.01694. URL https://doi.org/10.48550/arXiv.2306.01694.
  61. XNLI: evaluating cross-lingual sentence representations. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp.  2475–2485. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-1269. URL https://doi.org/10.18653/v1/d18-1269.
  62. Multilingual holistic bias: Extending descriptors and patterns to unveil demographic biases in languages at scale. CoRR, abs/2305.13198, 2023. doi: 10.48550/arXiv.2305.13198. URL https://doi.org/10.48550/arXiv.2305.13198.
  63. Law School Admission Council. https://www.lsac.org/lsat/taking-lsat/test-format/logical-reasoning, 2019. Accessed Sept. 16, 2019.
  64. Kate Crawford. The trouble with bias. In Conference on Neural Information Processing Systems, invited speaker, 2017.
  65. Can large language models provide feedback to students? a case study on chatgpt, Apr 2023. URL edarxiv.org/hcgzj.
  66. VNHSGE: vietnamese high school graduation examination dataset for large language models. CoRR, abs/2305.12199, 2023. doi: 10.48550/arXiv.2305.12199. URL https://doi.org/10.48550/arXiv.2305.12199.
  67. Racial bias in hate speech and abusive language detection datasets. CoRR, abs/1905.12516, 2019. URL http://arxiv.org/abs/1905.12516.
  68. Ernest Davis. Representations of commonsense knowledge. notThenot Morgan Kaufmann series in representation and reasoning. Morgan Kaufmann, 1990. ISBN 978-1-55860-033-1.
  69. Masterkey: Automated jailbreak across multiple large language model chatbots, 2023a.
  70. Towards faithful dialogues via focus learning. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  4554–4566. Association for Computational Linguistics, 2023b. doi: 10.18653/V1/2023.ACL-LONG.250. URL https://doi.org/10.18653/v1/2023.acl-long.250.
  71. How ready are pre-trained abstractive models and llms for legal case judgement summarization? In Jack G. Conrad, Daniel W. Linna Jr., Jason R. Baron, Hans Henseler, Paheli Bhattacharya, Aileen Nielsen, Jyothi K. Vinjumur, Jeremy Pickens, and Amanda Jones (eds.), Proceedings of the Third International Workshop on Artificial Intelligence and Intelligent Assistance for Legal Professionals in the Digital Workplace (LegalAIIA 2023) co-located with the 19th International Conference on Artificial Intelligence and Law (ICAIL 2023), Braga, Portugal, June 19, 2023, volume 3423 of CEUR Workshop Proceedings, pp.  8–19. CEUR-WS.org, 2023. URL https://ceur-ws.org/Vol-3423/paper2.pdf.
  72. Toxicity in chatgpt: Analyzing persona-assigned language models. CoRR, abs/2304.05335, 2023. doi: 10.48550/arXiv.2304.05335. URL https://doi.org/10.48550/arXiv.2304.05335.
  73. On measuring and mitigating biased inferences of word embeddings. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  7659–7666. AAAI Press, 2020. URL https://ojs.aaai.org/index.php/AAAI/article/view/6267.
  74. On measures of biases and harms in NLP. In Yulan He, Heng Ji, Yang Liu, Sujian Li, Chia-Hui Chang, Soujanya Poria, Chenghua Lin, Wray L. Buntine, Maria Liakata, Hanqi Yan, Zonghan Yan, Sebastian Ruder, Xiaojun Wan, Miguel Arana-Catania, Zhongyu Wei, Hen-Hsen Huang, Jheng-Long Wu, Min-Yuh Day, Pengfei Liu, and Ruifeng Xu (eds.), Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, Online only, November 20-23, 2022, pp.  246–267. Association for Computational Linguistics, 2022. URL https://aclanthology.org/2022.findings-aacl.24.
  75. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.
  76. BOLD: dataset and metrics for measuring biases in open-ended language generation. In Madeleine Clare Elish, William Isaac, and Richard S. Zemel (eds.), FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021, pp. 862–872. ACM, 2021. doi: 10.1145/3442188.3445924. URL https://doi.org/10.1145/3442188.3445924.
  77. Addressing age-related bias in sentiment analysis. In Sarit Kraus (ed.), Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pp.  6146–6150. ijcai.org, 2019. doi: 10.24963/ijcai.2019/852. URL https://doi.org/10.24963/ijcai.2019/852.
  78. The second conversational intelligence challenge (convai2). CoRR, abs/1902.00098, 2019a. URL http://arxiv.org/abs/1902.00098.
  79. Wizard of wikipedia: Knowledge-powered conversational agents. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019b. URL https://openreview.net/forum?id=r1l73iRqKm.
  80. Measuring and mitigating unintended bias in text classification. In Jason Furman, Gary E. Marchant, Huw Price, and Francesca Rossi (eds.), Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES 2018, New Orleans, LA, USA, February 02-03, 2018, pp. 67–73. ACM, 2018. doi: 10.1145/3278721.3278729. URL https://doi.org/10.1145/3278721.3278729.
  81. Quan Do. Jigsaw unintended bias in toxicity classification. 2019.
  82. Igor Douven. https://plato.stanford.edu/archives/sum2017/entries/abduction/, 2017. Abduction.
  83. Alpacafarm: A simulation framework for methods that learn from human feedback. CoRR, abs/2305.14387, 2023. doi: 10.48550/arXiv.2305.14387. URL https://doi.org/10.48550/arXiv.2305.14387.
  84. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 5055–5070. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.454. URL https://doi.org/10.18653/v1/2020.acl-main.454.
  85. Evaluating groundedness in dialogue systems: The BEGIN benchmark. CoRR, abs/2105.00071, 2021. URL https://arxiv.org/abs/2105.00071.
  86. Faithdial: A faithful benchmark for information-seeking dialogue. Trans. Assoc. Comput. Linguistics, 10:1473–1490, 2022a. URL https://transacl.org/ojs/index.php/tacl/article/view/4113.
  87. Evaluating attribution in dialogue systems: The BEGIN benchmark. Trans. Assoc. Comput. Linguistics, 10:1066–1083, 2022b. URL https://transacl.org/ojs/index.php/tacl/article/view/3977.
  88. Latent hatred: A benchmark for understanding implicit hate speech. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp.  345–363. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.29. URL https://doi.org/10.18653/v1/2021.emnlp-main.29.
  89. Moral stories: Situated reasoning about norms, intents, actions, and their consequences. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp.  698–718. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.54. URL https://doi.org/10.18653/v1/2021.emnlp-main.54.
  90. Summeval: Re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguistics, 9:391–409, 2021. doi: 10.1162/tacl_a_00373. URL https://doi.org/10.1162/tacl_a_00373.
  91. Qafacteval: Improved qa-based factual consistency evaluation for summarization. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp.  2587–2601. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.naacl-main.187. URL https://doi.org/10.18653/v1/2022.naacl-main.187.
  92. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Anna Korhonen, David R. Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  2214–2220. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1213. URL https://doi.org/10.18653/v1/p19-1213.
  93. Evaluating superhuman models with consistency checks. CoRR, abs/2306.09983, 2023. doi: 10.48550/arXiv.2306.09983. URL https://doi.org/10.48550/arXiv.2306.09983.
  94. Equalizing gender biases in neural machine translation with word embeddings techniques. CoRR, abs/1901.03116, 2019. URL http://arxiv.org/abs/1901.03116.
  95. Social chemistry 101: Learning to reason about social and moral norms. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 653–670. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp-main.48. URL https://doi.org/10.18653/v1/2020.emnlp-main.48.
  96. Theory of mind. Current biology, 15(17):R644–R645, 2005.
  97. Gptscore: Evaluate as you desire. CoRR, abs/2302.04166, 2023. doi: 10.48550/arXiv.2302.04166. URL https://doi.org/10.48550/arXiv.2302.04166.
  98. Sensitivity and robustness of large language models to prompt in japanese. CoRR, abs/2305.08714, 2023. doi: 10.48550/arXiv.2305.08714. URL https://doi.org/10.48550/arXiv.2305.08714.
  99. Understanding social reasoning in language models with language models. CoRR, abs/2306.15448, 2023. doi: 10.48550/arXiv.2306.15448. URL https://doi.org/10.48550/arXiv.2306.15448.
  100. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, 2021.
  101. PAL: program-aided language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 10764–10799. PMLR, 2023. URL https://proceedings.mlr.press/v202/gao23f.html.
  102. Towards understanding gender bias in relation extraction. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 2943–2953. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.265. URL https://doi.org/10.18653/v1/2020.acl-main.265.
  103. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pp. 3356–3369. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://doi.org/10.18653/v1/2020.findings-emnlp.301.
  104. Trueteacher: Learning factual consistency evaluation with large language models. CoRR, abs/2305.11171, 2023. doi: 10.48550/arXiv.2305.11171. URL https://doi.org/10.48550/arXiv.2305.11171.
  105. Bernard Gert. Common Morality: Deciding What to Do. Oxford University Press, 09 2004. ISBN 9780195173710. doi: 10.1093/0195173716.001.0001. URL https://doi.org/10.1093/0195173716.001.0001.
  106. Assessing the factual accuracy of generated text. In Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis (eds.), Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pp.  166–175. ACM, 2019. doi: 10.1145/3292500.3330955. URL https://doi.org/10.1145/3292500.3330955.
  107. Clinical language understanding evaluation (CLUE). CoRR, abs/2209.14377, 2022. doi: 10.48550/arXiv.2209.14377. URL https://doi.org/10.48550/arXiv.2209.14377.
  108. Evaluating factuality in generation with dependency-level entailment. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pp. 3592–3603. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.322. URL https://doi.org/10.18653/v1/2020.findings-emnlp.322.
  109. Annotating and modeling fine-grained factuality in summarization. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp.  1449–1462. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.naacl-main.114. URL https://doi.org/10.18653/v1/2021.naacl-main.114.
  110. Liberals and conservatives rely on different sets of moral foundations. Journal of personality and social psychology, 96(5):1029, 2009.
  111. Message understanding conference- 6: A brief history. In 16th International Conference on Computational Linguistics, Proceedings of the Conference, COLING 1996, Center for Sprogteknologi, Copenhagen, Denmark, August 5-9, 1996, pp.  466–471, 1996. URL https://aclanthology.org/C96-1079/.
  112. Dialfact: A benchmark for fact-checking in dialogue. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  3785–3801. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.263. URL https://doi.org/10.18653/v1/2022.acl-long.263.
  113. FOLIO: natural language reasoning with first-order logic. CoRR, abs/2209.00840, 2022. doi: 10.48550/arXiv.2209.00840. URL https://doi.org/10.48550/arXiv.2209.00840.
  114. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. CoRR, abs/2305.11554, 2023. doi: 10.48550/arXiv.2305.11554. URL https://doi.org/10.48550/arXiv.2305.11554.
  115. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  3309–3326. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.234. URL https://doi.org/10.18653/v1/2022.acl-long.234.
  116. Racism is a virus: anti-asian hate and counterspeech in social media during the COVID-19 crisis. In Michele Coscia, Alfredo Cuzzocrea, Kai Shu, Ralf Klamma, Sharyn O’Halloran, and Jon G. Rokne (eds.), ASONAM ’21: International Conference on Advances in Social Networks Analysis and Mining, Virtual Event, The Netherlands, November 8 - 11, 2021, pp.  90–94. ACM, 2021. doi: 10.1145/3487351.3488324. URL https://doi.org/10.1145/3487351.3488324.
  117. Aligning AI with shared human values. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021a. URL https://openreview.net/forum?id=dNy_RKzJacY.
  118. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021b. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  119. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021c. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html.
  120. A dataset for statutory reasoning in tax law entailment and question answering. In Nikolaos Aletras, Ion Androutsopoulos, Leslie Barrett, Adam Meyers, and Daniel Preotiuc-Pietro (eds.), Proceedings of the Natural Legal Language Processing Workshop 2020 co-located with the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD 2020), Virtual Workshop, August 24, 2020, volume 2645 of CEUR Workshop Proceedings, pp.  31–38. CEUR-WS.org, 2020. URL https://ceur-ws.org/Vol-2645/paper5.pdf.
  121. $q^2$: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp.  7856–7870. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.619. URL https://doi.org/10.18653/v1/2021.emnlp-main.619.
  122. Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment. Social Psychological and Personality Science, 11(8):1057–1071, 2020.
  123. The extended moral foundations dictionary (emfd): Development and applications of a crowd-sourced approach to extracting moral intuitions from text. Behavior research methods, 53:232–246, 2021.
  124. Learning to solve arithmetic word problems with verb categorization. In Alessandro Moschitti, Bo Pang, and Walter Daelemans (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp.  523–533. ACL, 2014. doi: 10.3115/v1/d14-1058. URL https://doi.org/10.3115/v1/d14-1058.
  125. An empirical study of metrics to measure representational harms in pre-trained language models. CoRR, abs/2301.09211, 2023. doi: 10.48550/arXiv.2301.09211. URL https://doi.org/10.48550/arXiv.2301.09211.
  126. Geneturing tests gpt models in genomics. bioRxiv: the preprint server for biology, 2023.
  127. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers. The Association for Computer Linguistics, 2016. doi: 10.18653/v1/p16-2096. URL https://doi.org/10.18653/v1/p16-2096.
  128. Tool documentation enables zero-shot tool-usage with large language models. CoRR, abs/2308.00675, 2023. doi: 10.48550/arXiv.2308.00675. URL https://doi.org/10.48550/arXiv.2308.00675.
  129. What have we achieved on text summarization? In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 446–469. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp-main.33. URL https://doi.org/10.18653/v1/2020.emnlp-main.33.
  130. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. In Ying Ding, Jie Tang, Juan F. Sequeda, Lora Aroyo, Carlos Castillo, and Geert-Jan Houben (eds.), Companion Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, pp.  294–297. ACM, 2023a. doi: 10.1145/3543873.3587368. URL https://doi.org/10.1145/3543873.3587368.
  131. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  9118–9147. PMLR, 2022a. URL https://proceedings.mlr.press/v162/huang22a.html.
  132. Inner monologue: Embodied reasoning through planning with language models. In Karen Liu, Dana Kulic, and Jeffrey Ichnowski (eds.), Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pp.  1769–1782. PMLR, 2022b. URL https://proceedings.mlr.press/v205/huang23c.html.
  133. Trustgpt: A benchmark for trustworthy and responsible large language models. CoRR, abs/2306.11507, 2023b. doi: 10.48550/arXiv.2306.11507. URL https://doi.org/10.48550/arXiv.2306.11507.
  134. CBBQ: A chinese bias benchmark dataset curated with human-ai collaboration for large language models. CoRR, abs/2306.16244, 2023. doi: 10.48550/arXiv.2306.16244. URL https://doi.org/10.48550/arXiv.2306.16244.
  135. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. CoRR, abs/2305.08322, 2023c. doi: 10.48550/arXiv.2305.08322. URL https://doi.org/10.48550/arXiv.2305.08322.
  136. Social biases in NLP models as barriers for persons with disabilities. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 5491–5501. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.487. URL https://doi.org/10.18653/v1/2020.acl-main.487.
  137. Do as I can, not as I say: Grounding language in robotic affordances. In Karen Liu, Dana Kulic, and Jeffrey Ichnowski (eds.), Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pp.  287–318. PMLR, 2022. URL https://proceedings.mlr.press/v205/ichter23a.html.
  138. Search-based neural structured learning for sequential question answering. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1821–1831. Association for Computational Linguistics, 2017. doi: 10.18653/V1/P17-1167. URL https://doi.org/10.18653/v1/P17-1167.
  139. Multi-dimensional evaluation of text summarization with in-context learning. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  8487–8495. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.findings-acl.537. URL https://doi.org/10.18653/v1/2023.findings-acl.537.
  140. Exploring chatgpt’s ability to rank content: A preliminary study on consistency with human preferences. CoRR, abs/2303.07610, 2023. doi: 10.48550/arXiv.2303.07610. URL https://doi.org/10.48550/arXiv.2303.07610.
  141. Is chatgpt A good translator? A preliminary study. CoRR, abs/2301.08745, 2023. doi: 10.48550/arXiv.2301.08745. URL https://doi.org/10.48550/arXiv.2301.08745.
  142. Pubmedqa: A dataset for biomedical research question answering. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp.  2567–2577. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1259. URL https://doi.org/10.18653/v1/D19-1259.
  143. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. CoRR, abs/2304.09667, 2023. doi: 10.48550/arXiv.2304.09667. URL https://doi.org/10.48550/arXiv.2304.09667.
  144. When to make exceptions: Exploring language models as accounts of human moral judgment. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b654d6150630a5ba5df7a55621390daf-Abstract-Conference.html.
  145. Classification of moral foundations in microblog political discourse. In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 720–730. Association for Computational Linguistics, 2018. doi: 10.18653/v1/P18-1067. URL https://aclanthology.org/P18-1067/.
  146. Taxinli: Taking a ride up the NLU hill. In Raquel Fernández and Tal Linzen (eds.), Proceedings of the 24th Conference on Computational Natural Language Learning, CoNLL 2020, Online, November 19-20, 2020, pp.  41–55. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.conll-1.4. URL https://doi.org/10.18653/v1/2020.conll-1.4.
  147. Language models (mostly) know what they know. CoRR, abs/2207.05221, 2022. doi: 10.48550/arXiv.2207.05221. URL https://doi.org/10.48550/arXiv.2207.05221.
  148. Gpt-4 passes the bar exam. Available at SSRN 4389233, 2023.
  149. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Marilyn A. Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp.  252–262. Association for Computational Linguistics, 2018. doi: 10.18653/v1/n18-1023. URL https://doi.org/10.18653/v1/n18-1023.
  150. Unifiedqa: Crossing format boundaries with a single QA system. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pp. 1896–1907. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.171. URL https://doi.org/10.18653/v1/2020.findings-emnlp.171.
  151. QASC: A dataset for question answering via sentence composition. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  8082–8090. AAAI Press, 2020. URL https://ojs.aaai.org/index.php/AAAI/article/view/6319.
  152. Dynabench: Rethinking benchmarking in NLP. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp.  4110–4124. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.naacl-main.324. URL https://doi.org/10.18653/v1/2021.naacl-main.324.
  153. Prosocialdialog: A prosocial backbone for conversational agents. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  4005–4029. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.emnlp-main.267. URL https://doi.org/10.18653/v1/2022.emnlp-main.267.
  154. Examining gender and race bias in two hundred sentiment analysis systems. In Malvina Nissim, Jonathan Berant, and Alessandro Lenci (eds.), Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, *SEM@NAACL-HLT 2018, New Orleans, Louisiana, USA, June 5-6, 2018, pp.  43–53. Association for Computational Linguistics, 2018. doi: 10.18653/v1/s18-2005. URL https://doi.org/10.18653/v1/s18-2005.
  155. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB.
  156. On robustness-accuracy characterization of large language models using synthetic datasets. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
  157. The narrativeqa reading comprehension challenge. Trans. Assoc. Comput. Linguistics, 6:317–328, 2018. doi: 10.1162/tacl_a_00023. URL https://doi.org/10.1162/tacl_a_00023.
  158. Large language models are state-of-the-art evaluators of translation quality. In Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra Escartín, Mikel L. Forcada, Maja Popovic, Carolina Scarton, and Helena Moniz (eds.), Proceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023, Tampere, Finland, 12-15 June 2023, pp.  193–203. European Association for Machine Translation, 2023. URL https://aclanthology.org/2023.eamt-1.19.
  159. Writing your own book: A method for going from closed to open book QA to improve robustness and performance of smaller llms. CoRR, abs/2305.11334, 2023. doi: 10.48550/arXiv.2305.11334. URL https://doi.org/10.48550/arXiv.2305.11334.
  160. MAWPS: A math word problem repository. In Kevin Knight, Ani Nenkova, and Owen Rambow (eds.), NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp.  1152–1157. The Association for Computational Linguistics, 2016. doi: 10.18653/V1/N16-1136. URL https://doi.org/10.18653/v1/n16-1136.
  161. Evaluating the factual consistency of abstractive text summarization. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 9332–9346. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp-main.750. URL https://doi.org/10.18653/v1/2020.emnlp-main.750.
  162. JGLUE: japanese general language understanding evaluation. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pp.  2957–2966. European Language Resources Association, 2022. URL https://aclanthology.org/2022.lrec-1.317.
  163. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https://doi.org/10.1162/tacl_a_00276.
  164. Summac: Re-visiting nli-based models for inconsistency detection in summarization. Trans. Assoc. Comput. Linguistics, 10:163–177, 2022. doi: 10.1162/tacl_a_00453. URL https://doi.org/10.1162/tacl_a_00453.
  165. Araweat: Multidimensional analysis of biases in arabic word embeddings. In Imed Zitouni, Muhammad Abdul-Mageed, Houda Bouamor, Fethi Bougares, Mahmoud El-Haj, Nadi Tomeh, and Wajdi Zaghouani (eds.), Proceedings of the Fifth Arabic Natural Language Processing Workshop, WANLP@COLING 2020, Barcelona, Spain (Online), December 12, 2020, pp. 192–199. Association for Computational Linguistics, 2020. URL https://www.aclweb.org/anthology/2020.wanlp-1.17/.
  166. Going gray, failure to hire, and the ick factor: Analyzing how older bloggers talk about ageism. In Charlotte P. Lee, Steven E. Poltrock, Louise Barkhuus, Marcos Borges, and Wendy A. Kellogg (eds.), Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, CSCW 2017, Portland, OR, USA, February 25 - March 1, 2017, pp.  655–668. ACM, 2017. doi: 10.1145/2998181.2998275. URL https://doi.org/10.1145/2998181.2998275.
  167. A new generation of perspective API: efficient multilingual character-level transformers. In Aidong Zhang and Huzefa Rangwala (eds.), KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, pp.  3197–3207. ACM, 2022. doi: 10.1145/3534678.3539147. URL https://doi.org/10.1145/3534678.3539147.
  168. Comparing code explanations created by students and large language models. In Mikko-Jussi Laakso, Mattia Monga, Simon, and Judithe Sheard (eds.), Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, ITiCSE 2023, Turku, Finland, July 7-12, 2023, pp.  124–130. ACM, 2023a. doi: 10.1145/3587102.3588785. URL https://doi.org/10.1145/3587102.3588785.
  169. Comparing code explanations created by students and large language models. arXiv preprint arXiv:2304.03938, 2023b.
  170. Core mechanisms in ‘theory of mind’. Trends in cognitive sciences, 8(12):528–533, 2004.
  171. Hector J. Levesque. The winograd schema challenge. In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA, March 21-23, 2011. AAAI, 2011. URL http://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2502.
  172. The diagnostic and triage accuracy of the gpt-3 artificial intelligence model. medRxiv, 2023. doi: 10.1101/2023.01.30.23285067. URL https://www.medrxiv.org/content/early/2023/02/01/2023.01.30.23285067.
  173. CMMLU: measuring massive multitask language understanding in chinese. CoRR, abs/2306.09212, 2023a. doi: 10.48550/arXiv.2306.09212. URL https://doi.org/10.48550/arXiv.2306.09212.
  174. "hot" chatgpt: The promise of chatgpt in detecting and discriminating hateful, offensive, and toxic comments on social media. CoRR, abs/2304.10619, 2023b. doi: 10.48550/ARXIV.2304.10619. URL https://doi.org/10.48550/arXiv.2304.10619.
  175. Api-bank: A benchmark for tool-augmented llms. CoRR, abs/2304.08244, 2023c. doi: 10.48550/arXiv.2304.08244. URL https://doi.org/10.48550/arXiv.2304.08244.
  176. PRD: peer rank and discussion improve large language model based evaluations. CoRR, abs/2307.02762, 2023d. doi: 10.48550/arXiv.2307.02762. URL https://doi.org/10.48550/arXiv.2307.02762.
  177. Unqovering stereotyping biases via underspecified questions. CoRR, abs/2010.02428, 2020. URL https://arxiv.org/abs/2010.02428.
  178. CLEVA: chinese language models evaluation platform. CoRR, abs/2308.04813, 2023e. doi: 10.48550/arXiv.2308.04813. URL https://doi.org/10.48550/arXiv.2308.04813.
  179. White-box multi-objective adversarial attack on dialogue generation. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  1778–1792. Association for Computational Linguistics, 2023f. doi: 10.18653/v1/2023.acl-long.100. URL https://doi.org/10.18653/v1/2023.acl-long.100.
  180. Code as policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023, pp.  9493–9500. IEEE, 2023. doi: 10.1109/ICRA48891.2023.10160591. URL https://doi.org/10.1109/ICRA48891.2023.10160591.
  181. Holistic evaluation of language models. CoRR, abs/2211.09110, 2022. doi: 10.48550/arXiv.2211.09110. URL https://doi.org/10.48550/arXiv.2211.09110.
  182. Can large language models reason about medical questions? CoRR, abs/2207.08143, 2022. doi: 10.48550/arXiv.2207.08143. URL https://doi.org/10.48550/arXiv.2207.08143.
  183. Agentsims: An open-source sandbox for large language model evaluation. arXiv preprint arXiv:2308.04026, 2023.
  184. Truthfulqa: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  3214–3252. Association for Computational Linguistics, 2022a. doi: 10.18653/v1/2022.acl-long.229. URL https://doi.org/10.18653/v1/2022.acl-long.229.
  185. Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res., 2022, 2022b. URL https://openreview.net/forum?id=8s8K2UZGTZ.
  186. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 158–167. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1015. URL https://doi.org/10.18653/v1/P17-1015.
  187. M3KE: A massive multi-level multi-subject knowledge evaluation benchmark for chinese large language models. CoRR, abs/2305.10263, 2023a. doi: 10.48550/arXiv.2305.10263. URL https://doi.org/10.48550/arXiv.2305.10263.
  188. Natural language inference in context - investigating contextual reasoning over long texts. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp.  13388–13396. AAAI Press, 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/17580.
  189. Logiqa 2.0 - an improved dataset for logical reasoning in natural language understanding. IEEE ACM Trans. Audio Speech Lang. Process., 31:2947–2962, 2023b. doi: 10.1109/TASLP.2023.3293046. URL https://doi.org/10.1109/TASLP.2023.3293046.
  190. Evaluating the logical reasoning ability of chatgpt and GPT-4. CoRR, abs/2304.03439, 2023c. doi: 10.48550/arXiv.2304.03439. URL https://doi.org/10.48550/arXiv.2304.03439.
  191. Does gender matter? towards fairness in dialogue systems. In Donia Scott, Núria Bel, and Chengqing Zong (eds.), Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pp.  4403–4416. International Committee on Computational Linguistics, 2020a. doi: 10.18653/v1/2020.coling-main.390. URL https://doi.org/10.18653/v1/2020.coling-main.390.
  192. Conceptnet—a practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226, 2004.
  193. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In Christian Bessiere (ed.), Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pp.  3622–3628. ijcai.org, 2020b. doi: 10.24963/ijcai.2020/501. URL https://doi.org/10.24963/ijcai.2020/501.
  194. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. CoRR, abs/2305.01210, 2023d. doi: 10.48550/ARXIV.2305.01210. URL https://doi.org/10.48550/arXiv.2305.01210.
  195. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023e.
  196. Training socially aligned language models in simulated human society. CoRR, abs/2305.16960, 2023f. doi: 10.48550/arXiv.2305.16960. URL https://doi.org/10.48550/arXiv.2305.16960.
  197. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023g.
  198. G-eval: NLG evaluation using GPT-4 with better human alignment. CoRR, abs/2303.16634, 2023h. doi: 10.48550/arXiv.2303.16634. URL https://doi.org/10.48550/arXiv.2303.16634.
  199. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. CoRR, abs/2308.05374, 2023i. doi: 10.48550/arXiv.2308.05374. URL https://doi.org/10.48550/arXiv.2308.05374.
  200. Jailbreaking chatgpt via prompt engineering: An empirical study. CoRR, abs/2305.13860, 2023j. doi: 10.48550/arXiv.2305.13860. URL https://doi.org/10.48550/arXiv.2305.13860.
  201. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
  202. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguistics, 8:726–742, 2020c. doi: 10.1162/tacl_a_00343. URL https://doi.org/10.1162/tacl_a_00343.
  203. What was your name again? interrogating generative conversational models for factual consistency evaluation. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pp.  509–519, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.gem-1.47. URL https://aclanthology.org/2022.gem-1.47.
  204. SCRUPLES: A corpus of community ethical judgments on 32, 000 real-life anecdotes. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp.  13470–13479. AAAI Press, 2021. doi: 10.1609/aaai.v35i15.17589. URL https://doi.org/10.1609/aaai.v35i15.17589.
  205. Chameleon: Plug-and-play compositional reasoning with large language models. CoRR, abs/2304.09842, 2023a. doi: 10.48550/arXiv.2304.09842. URL https://doi.org/10.48550/arXiv.2304.09842.
  206. GEAR: augmenting language models with generalizable and efficient tool resolution. CoRR, abs/2307.08775, 2023b. doi: 10.48550/arXiv.2307.08775. URL https://doi.org/10.48550/arXiv.2307.08775.
  207. Chatgpt as a factual inconsistency evaluator for abstractive text summarization. CoRR, abs/2303.15621, 2023. doi: 10.48550/arXiv.2303.15621. URL https://doi.org/10.48550/arXiv.2303.15621.
  208. Www’18 open challenge: Financial opinion mining and question answering. In Pierre-Antoine Champin, Fabien Gandon, Mounia Lalmas, and Panagiotis G. Ipeirotis (eds.), Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23-27, 2018, pp.  1941–1942. ACM, 2018. doi: 10.1145/3184558.3192301. URL https://doi.org/10.1145/3184558.3192301.
  209. Socially aware bias measurements for hindi language representations. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp.  1041–1052. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.naacl-main.76. URL https://doi.org/10.18653/v1/2022.naacl-main.76.
  210. Good debt or bad debt: Detecting semantic orientations in economic texts. J. Assoc. Inf. Sci. Technol., 65(4):782–796, 2014. doi: 10.1002/ASI.23062. URL https://doi.org/10.1002/asi.23062.
  211. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. CoRR, abs/2303.08896, 2023. doi: 10.48550/arXiv.2303.08896. URL https://doi.org/10.48550/arXiv.2303.08896.
  212. Hatexplain: A benchmark dataset for explainable hate speech detection. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp.  14867–14875. AAAI Press, 2021. doi: 10.1609/AAAI.V35I17.17745. URL https://doi.org/10.1609/aaai.v35i17.17745.
  213. Türkçe tweetler üzerinde makine öğrenmesi ile nefret söylemi tespiti. Avrupa Bilim ve Teknoloji Dergisi, (24):328–334, 2021.
  214. On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 1906–1919. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.173. URL https://doi.org/10.18653/v1/2020.acl-main.173.
  215. Augmented language models: a survey. CoRR, abs/2302.07842, 2023. doi: 10.48550/arXiv.2302.07842. URL https://doi.org/10.48550/arXiv.2302.07842.
  216. A diverse corpus for evaluating and developing english math word problem solvers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 975–984. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.92. URL https://doi.org/10.18653/v1/2020.acl-main.92.
  217. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp.  2381–2391. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-1260. URL https://doi.org/10.18653/v1/d18-1260.
  218. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. CoRR, abs/2305.14251, 2023. doi: 10.48550/arXiv.2305.14251. URL https://doi.org/10.48550/arXiv.2305.14251.
  219. Saif M. Mohammad. Obtaining reliable human ratings of valence, arousal, and dominance for 20, 000 english words. In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 174–184. Association for Computational Linguistics, 2018. doi: 10.18653/v1/P18-1017. URL https://aclanthology.org/P18-1017/.
  220. Crosslingual generalization through multitask finetuning. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  15991–16111. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.891. URL https://doi.org/10.18653/v1/2023.acl-long.891.
  221. Stereoset: Measuring stereotypical bias in pretrained language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp.  5356–5371. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.416. URL https://doi.org/10.18653/v1/2021.acl-long.416.
  222. Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332, 2021. URL https://arxiv.org/abs/2112.09332.
  223. Semeval-2016 task 4: Sentiment analysis in twitter. CoRR, abs/1912.01973, 2019. URL http://arxiv.org/abs/1912.01973.
  224. Crows-pairs: A challenge dataset for measuring social biases in masked language models. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 1953–1967. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp-main.154. URL https://doi.org/10.18653/v1/2020.emnlp-main.154.
  225. How well do SOTA legal reasoning models support abductive reasoning? In Joaquín Arias, Sotiris Batsakis, Wolfgang Faber, Gopal Gupta, Francesco Pacenza, Emmanuel Papadakis, Livio Robaldo, Kilian Rückschloß, Elmer Salazar, Zeynep Gozen Saribatur, Ilias Tachmazidis, Felix Weitkämper, and Adam Z. Wyner (eds.), Proceedings of the International Conference on Logic Programming 2023 Workshops co-located with the 39th International Conference on Logic Programming (ICLP 2023), London, United Kingdom, July 9th and 10th, 2023, volume 3437 of CEUR Workshop Proceedings. CEUR-WS.org, 2023. URL https://ceur-ws.org/Vol-3437/paper1LPLR.pdf.
  226. Adversarial NLI: A new benchmark for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 4885–4901. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.ACL-MAIN.441. URL https://doi.org/10.18653/v1/2020.acl-main.441.
  227. Gpt as a financial advisor. Available at SSRN 4384861, 2023.
  228. Capabilities of GPT-4 on medical challenge problems. CoRR, abs/2303.13375, 2023. doi: 10.48550/arXiv.2303.13375. URL https://doi.org/10.48550/arXiv.2303.13375.
  229. Chatgpt goes to the operating room: evaluating gpt-4 performance and its potential in surgical education and training in the era of large language models. Annals of Surgical Treatment and Research, 104(5):269, 2023.
  230. Logicinference: A new dataset for teaching logical inference to seq2seq models. CoRR, abs/2203.15099, 2022. doi: 10.48550/arXiv.2203.15099. URL https://doi.org/10.48550/arXiv.2203.15099.
  231. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt/, 2022.
  232. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  233. Training language models to follow instructions with human feedback. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
  234. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp.  4812–4829. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.naacl-main.383. URL https://doi.org/10.18653/v1/2021.naacl-main.383.
  235. Cross-lingual name tagging and linking for 282 languages. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1946–1958. Association for Computational Linguistics, 2017. doi: 10.18653/V1/P17-1178. URL https://doi.org/10.18653/v1/P17-1178.
  236. Learning gain differences between chatgpt and human tutor generated algebra hints. CoRR, abs/2302.06871, 2023. doi: 10.48550/arXiv.2302.06871. URL https://doi.org/10.48550/arXiv.2302.06871.
  237. TALM: tool augmented language models. CoRR, abs/2205.12255, 2022. doi: 10.48550/arXiv.2205.12255. URL https://doi.org/10.48550/arXiv.2205.12255.
  238. Reducing gender bias in abusive language detection. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp.  2799–2804. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-1302. URL https://doi.org/10.18653/v1/d18-1302.
  239. "why do I feel offended?" - korean dataset for offensive language identification. In Andreas Vlachos and Isabelle Augenstein (eds.), Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pp.  1112–1123. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.findings-eacl.85.
  240. BBQ: A hand-built bias benchmark for question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  2086–2105. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.findings-acl.165. URL https://doi.org/10.18653/v1/2022.findings-acl.165.
  241. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp.  1470–1480. The Association for Computer Linguistics, 2015. doi: 10.3115/V1/P15-1142. URL https://doi.org/10.3115/v1/p15-1142.
  242. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp.  2080–2094. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.naacl-main.168. URL https://doi.org/10.18653/v1/2021.naacl-main.168.
  243. Gorilla: Large language model connected with massive apis. CoRR, abs/2305.15334, 2023. doi: 10.48550/ARXIV.2305.15334. URL https://doi.org/10.48550/arXiv.2305.15334.
  244. Glove: Global vectors for word representation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp.  1532–1543. ACL, 2014. doi: 10.3115/v1/d14-1162. URL https://doi.org/10.3115/v1/d14-1162.
  245. Discovering language model behaviors with model-written evaluations. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  13387–13434. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.findings-acl.847. URL https://doi.org/10.18653/v1/2023.findings-acl.847.
  246. Deep contextualized word representations. In Marilyn A. Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp.  2227–2237. Association for Computational Linguistics, 2018. doi: 10.18653/v1/n18-1202. URL https://doi.org/10.18653/v1/n18-1202.
  247. Language models as knowledge bases? In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp.  2463–2473. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1250. URL https://doi.org/10.18653/v1/D19-1250.
  248. CREATOR: disentangling abstract and concrete reasonings of large language models through tool creation. CoRR, abs/2305.14318, 2023. doi: 10.48550/arXiv.2305.14318. URL https://doi.org/10.48550/arXiv.2305.14318.
  249. Making language models better tool learners with execution feedback. CoRR, abs/2305.13068, 2023. doi: 10.48550/arXiv.2305.13068. URL https://doi.org/10.48550/arXiv.2305.13068.
  250. TIMEDIAL: temporal commonsense reasoning in dialog. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp.  7066–7076. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.549. URL https://doi.org/10.18653/v1/2021.acl-long.549.
  251. Webcpm: Interactive web search for chinese long-form question answering. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  8968–8988. Association for Computational Linguistics, 2023a. doi: 10.18653/v1/2023.acl-long.499. URL https://doi.org/10.18653/v1/2023.acl-long.499.
  252. Tool learning with foundation models. CoRR, abs/2304.08354, 2023b. doi: 10.48550/arXiv.2304.08354. URL https://doi.org/10.48550/arXiv.2304.08354.
  253. Toolllm: Facilitating large language models to master 16000+ real-world apis. CoRR, abs/2307.16789, 2023c. doi: 10.48550/arXiv.2307.16789. URL https://doi.org/10.48550/arXiv.2307.16789.
  254. Overview and discussion of the competition on legal information extraction/entailment (COLIEE) 2021. Rev. Socionetwork Strateg., 16(1):111–133, 2022. doi: 10.1007/S12626-022-00105-Z. URL https://doi.org/10.1007/s12626-022-00105-z.
  255. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  256. SQuAD: 100, 000+ questions for machine comprehension of text. In Jian Su, Xavier Carreras, and Kevin Duh (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp.  2383–2392. The Association for Computational Linguistics, 2016. doi: 10.18653/v1/d16-1264. URL https://doi.org/10.18653/v1/d16-1264.
  257. Know what you don’t know: Unanswerable questions for squad. In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pp. 784–789. Association for Computational Linguistics, 2018. doi: 10.18653/v1/P18-2124. URL https://aclanthology.org/P18-2124/.
  258. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Anna Korhonen, David R. Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  5370–5381. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1534. URL https://doi.org/10.18653/v1/p19-1534.
  259. Coqa: A conversational question answering challenge. Trans. Assoc. Comput. Linguistics, 7:249–266, 2019. doi: 10.1162/tacl_a_00266. URL https://doi.org/10.1162/tacl_a_00266.
  260. Investigating failures of automatic translation in the case of unambiguous gender. CoRR, abs/2104.07838, 2021. URL https://arxiv.org/abs/2104.07838.
  261. Enhancing the measurement of social effects by capturing morality. In Alexandra Balahur, Roman Klinger, Véronique Hoste, Carlo Strapparava, and Orphée De Clercq (eds.), Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA@NAACL-HLT 2019, Minneapolis, USA, June 6, 2019, pp.  35–45. Association for Computational Linguistics, 2019. doi: 10.18653/v1/w19-1305. URL https://doi.org/10.18653/v1/w19-1305.
  262. Factually consistent summarization via reinforcement learning with textual entailment feedback. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  6252–6272. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.344. URL https://doi.org/10.18653/v1/2023.acl-long.344.
  263. Recipes for building an open-domain chatbot. In Paola Merlo, Jörg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pp.  300–325. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.eacl-main.24. URL https://doi.org/10.18653/v1/2021.eacl-main.24.
  264. SOLID: A large-scale semi-supervised dataset for offensive language identification. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pp.  915–928. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.findings-acl.80. URL https://doi.org/10.18653/v1/2021.findings-acl.80.
  265. The programmer’s assistant: Conversational interaction with a large language model for software development. In Proceedings of the 28th International Conference on Intelligent User Interfaces, pp.  491–514, 2023a.
  266. The programmer’s assistant: Conversational interaction with a large language model for software development. In Proceedings of the 28th International Conference on Intelligent User Interfaces, IUI 2023, Sydney, NSW, Australia, March 27-31, 2023, pp.  491–514. ACM, 2023b. doi: 10.1145/3581641.3584037. URL https://doi.org/10.1145/3581641.3584037.
  267. Hatecheck: Functional tests for hate speech detection models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp.  41–58. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.4. URL https://doi.org/10.18653/v1/2021.acl-long.4.
  268. Solving general arithmetic word problems. In Lluís Màrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp.  1743–1752. The Association for Computational Linguistics, 2015. doi: 10.18653/v1/d15-1202. URL https://doi.org/10.18653/v1/d15-1202.
  269. TPTU: task planning and tool usage of large language model-based AI agents. CoRR, abs/2308.03427, 2023. doi: 10.48550/arXiv.2308.03427. URL https://doi.org/10.48550/arXiv.2308.03427.
  270. Gender bias in coreference resolution. In Marilyn A. Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pp.  8–14. Association for Computational Linguistics, 2018. doi: 10.18653/v1/n18-2002. URL https://doi.org/10.18653/v1/n18-2002.
  271. Conjnli: Natural language inference over conjunctive sentences. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 8240–8252. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp-main.661. URL https://doi.org/10.18653/v1/2020.emnlp-main.661.
  272. Lost at C: A user study on the security implications of large language model code assistants. In Joseph A. Calandrino and Carmela Troncoso (eds.), 32nd USENIX Security Symposium, USENIX Security 2023, Anaheim, CA, USA, August 9-11, 2023, pp.  2205–2222. USENIX Association, 2023a. URL https://www.usenix.org/conference/usenixsecurity23/presentation/sandoval.
  273. Lost at C: A user study on the security implications of large language model code assistants. arXiv preprint arXiv:2208.09727, 2023b.
  274. Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. CoRR, cs.CL/0306050, 2003. URL http://arxiv.org/abs/cs/0306050.
  275. Social iqa: Commonsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp.  4462–4472. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1454. URL https://doi.org/10.18653/v1/D19-1454.
  276. Social bias frames: Reasoning about social and power implications of language. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 5477–5490. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.486. URL https://doi.org/10.18653/v1/2020.acl-main.486.
  277. Evaluating language-model agents on realistic autonomous tasks. 2023.
  278. Explaining legal concepts with augmented large language models (GPT-4). CoRR, abs/2306.09525, 2023. doi: 10.48550/arXiv.2306.09525. URL https://doi.org/10.48550/arXiv.2306.09525.
  279. Evaluating the moral beliefs encoded in LLMs. CoRR, abs/2307.14324, 2023. doi: 10.48550/ARXIV.2307.14324. URL https://doi.org/10.48550/arXiv.2307.14324.
  280. Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761, 2023. doi: 10.48550/arXiv.2302.04761. URL https://doi.org/10.48550/arXiv.2302.04761.
  281. Questeval: Summarization asks for fact-based evaluation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp.  6594–6604. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.529. URL https://doi.org/10.18653/v1/2021.emnlp-main.529.
  282. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  4454–4470. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.244. URL https://doi.org/10.18653/v1/2023.acl-long.244.
  283. CPT: A pre-trained unbalanced transformer for both chinese language understanding and generation. CoRR, abs/2109.05729, 2021. URL https://arxiv.org/abs/2109.05729.
  284. Performance of chatgpt on USMLE: unlocking the potential of large language models for ai-assisted medical education. CoRR, abs/2307.00112, 2023. doi: 10.48550/arXiv.2307.00112. URL https://doi.org/10.48550/arXiv.2307.00112.
  285. The woman worked as a babysitter: On biases in language generation. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp.  3405–3410. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1339. URL https://doi.org/10.18653/v1/D19-1339.
  286. Revealing persona biases in dialogue systems. CoRR, abs/2104.08728, 2021. URL https://arxiv.org/abs/2104.08728.
  287. Model evaluation for extreme risks. CoRR, abs/2305.15324, 2023. doi: 10.48550/arXiv.2305.15324. URL https://doi.org/10.48550/arXiv.2305.15324.
  288. Exploring the robustness of large language models for solving programming problems. CoRR, abs/2306.14583, 2023. doi: 10.48550/arXiv.2306.14583. URL https://doi.org/10.48550/arXiv.2306.14583.
  289. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 10737–10746. Computer Vision Foundation / IEEE, 2020. doi: 10.1109/CVPR42600.2020.01075. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Shridhar_ALFRED_A_Benchmark_for_Interpreting_Grounded_Instructions_for_Everyday_Tasks_CVPR_2020_paper.html.
  290. Alfworld: Aligning text and embodied environments for interactive learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=0IOX0YcCdTn.
  291. Large language models encode clinical knowledge. CoRR, abs/2212.13138, 2022. doi: 10.48550/arXiv.2212.13138. URL https://doi.org/10.48550/arXiv.2212.13138.
  292. Towards expert-level medical question answering with large language models. CoRR, abs/2305.09617, 2023. doi: 10.48550/arXiv.2305.09617. URL https://doi.org/10.48550/arXiv.2305.09617.
  293. Impact of news on the commodity market: Dataset and results. CoRR, abs/2009.04202, 2020. URL https://arxiv.org/abs/2009.04202.
  294. Can you put it all together: Evaluating conversational agents’ ability to blend skills. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 2021–2030. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.ACL-MAIN.183. URL https://doi.org/10.18653/v1/2020.acl-main.183.
  295. "i’m sorry to hear that": Finding new biases in language models with a holistic descriptor dataset. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  9180–9211. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.emnlp-main.625. URL https://doi.org/10.18653/v1/2022.emnlp-main.625.
  296. Beyond classification: Financial reasoning in state-of-the-art language models. CoRR, abs/2305.01505, 2023a. doi: 10.48550/ARXIV.2305.01505. URL https://doi.org/10.48550/arXiv.2305.01505.
  297. Beyond classification: Financial reasoning in state-of-the-art language models. arXiv preprint arXiv:2305.01505, 2023b.
  298. Restgpt: Connecting large language models with real-world applications via restful apis. CoRR, abs/2306.06624, 2023. doi: 10.48550/arXiv.2306.06624. URL https://doi.org/10.48550/arXiv.2306.06624.
  299. Findings of the WMT 2020 shared task on machine translation robustness. In Loïc Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Yvette Graham, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, and Matteo Negri (eds.), Proceedings of the Fifth Conference on Machine Translation, WMT@EMNLP 2020, Online, November 19-20, 2020, pp.  76–91. Association for Computational Linguistics, 2020. URL https://aclanthology.org/2020.wmt-1.4/.
  300. Representing general relational knowledge in conceptnet 5. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, pp.  3679–3686. European Language Resources Association (ELRA), 2012. URL http://www.lrec-conf.org/proceedings/lrec2012/summaries/1072.html.
  301. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615, 2022. doi: 10.48550/arXiv.2206.04615. URL https://doi.org/10.48550/arXiv.2206.04615.
  302. BEHAVIOR: benchmark for everyday household activities in virtual, interactive, and ecological environments. In Aleksandra Faust, David Hsu, and Gerhard Neumann (eds.), Conference on Robot Learning, 8-11 November 2021, London, UK, volume 164 of Proceedings of Machine Learning Research, pp.  477–490. PMLR, 2021. URL https://proceedings.mlr.press/v164/srivastava22a.html.
  303. Evaluating gender bias in machine translation. In Anna Korhonen, David R. Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  1679–1684. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1164. URL https://doi.org/10.18653/v1/p19-1164.
  304. Robustification of multilingual language models to real-world noise in crosslingual zero-shot settings with robust contrastive pretraining. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pp.  1367–1383. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.eacl-main.100.
  305. A causal framework to quantify the robustness of mathematical reasoning with language models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  545–561. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.32. URL https://doi.org/10.18653/v1/2023.acl-long.32.
  306. Recitation-augmented language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=-cqvvvb-NkI.
  307. The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues. CoRR, abs/2205.07540, 2022. doi: 10.48550/arXiv.2205.07540. URL https://doi.org/10.48550/arXiv.2205.07540.
  308. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  4149–4158. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1421. URL https://doi.org/10.18653/v1/n19-1421.
  309. Evaluating the factual consistency of large language models through news summarization. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  5220–5255. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.findings-acl.322. URL https://doi.org/10.18653/v1/2023.findings-acl.322.
  310. Dureader_robust: A chinese dataset towards evaluating robustness and generalization of machine reading comprehension in real-world applications. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 2: Short Papers), Virtual Event, August 1-6, 2021, pp.  955–963. Association for Computational Linguistics, 2021a. doi: 10.18653/v1/2021.acl-short.120. URL https://doi.org/10.18653/v1/2021.acl-short.120.
  311. Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  11626–11644. Association for Computational Linguistics, 2023a. doi: 10.18653/v1/2023.acl-long.650. URL https://doi.org/10.18653/v1/2023.acl-long.650.
  312. Evaluating large language models on medical evidence summarization. npj Digital Medicine, 6(1):158, 2023b.
  313. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. CoRR, abs/2306.05301, 2023c. doi: 10.48550/arXiv.2306.05301. URL https://doi.org/10.48550/arXiv.2306.05301.
  314. Do multi-hop question answering systems know how to answer the single-hop sub-questions? In Paola Merlo, Jörg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pp.  3244–3249. Association for Computational Linguistics, 2021b. doi: 10.18653/v1/2021.eacl-main.283. URL https://doi.org/10.18653/v1/2021.eacl-main.283.
  315. Transformer-based language models for software vulnerability detection. In Annual Computer Security Applications Conference, ACSAC 2022, Austin, TX, USA, December 5-9, 2022, pp.  481–496. ACM, 2022a. doi: 10.1145/3564625.3567985. URL https://doi.org/10.1145/3564625.3567985.
  316. Transformer-based language models for software vulnerability detection. In Proceedings of the 38th Annual Computer Security Applications Conference, pp.  481–496, 2022b.
  317. Lamda: Language models for dialog applications. CoRR, abs/2201.08239, 2022. URL https://arxiv.org/abs/2201.08239.
  318. Diagnosing the first-order logical reasoning ability through logicnli. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp.  3738–3747. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.303. URL https://doi.org/10.18653/v1/2021.emnlp-main.303.
  319. Olid-br: offensive language identification dataset for brazilian portuguese. Language Resources and Evaluation, pp.  1–27, 2023.
  320. Newsqa: A machine comprehension dataset. In Phil Blunsom, Antoine Bordes, Kyunghyun Cho, Shay B. Cohen, Chris Dyer, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, and Scott Yih (eds.), Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017, pp. 191–200. Association for Computational Linguistics, 2017. doi: 10.18653/v1/w17-2623. URL https://doi.org/10.18653/v1/w17-2623.
  321. Falsesum: Generating document-level NLI examples for recognizing factual inconsistency in summarization. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp.  2763–2776. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.naacl-main.199. URL https://doi.org/10.18653/v1/2022.naacl-main.199.
  322. Learning from the worst: Dynamically generated datasets to improve online hate detection. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp.  1667–1682. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.132. URL https://doi.org/10.18653/v1/2021.acl-long.132.
  323. Superglue: A stickier benchmark for general-purpose language understanding systems. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  3261–3275, 2019a. URL https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html.
  324. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019b. URL https://openreview.net/forum?id=rJ4km2R5t7.
  325. Asking and answering questions to evaluate the factual consistency of summaries. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 5008–5020. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.450. URL https://doi.org/10.18653/v1/2020.acl-main.450.
  326. Adversarial GLUE: A multi-task benchmark for robustness evaluation of language models. In Joaquin Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/335f5352088d7d9bf74191e006d8e24c-Abstract-round2.html.
  327. Is chatgpt a good NLG evaluator? A preliminary study. CoRR, abs/2303.04048, 2023a. doi: 10.48550/arXiv.2303.04048. URL https://doi.org/10.48550/arXiv.2303.04048.
  328. On the robustness of chatgpt: An adversarial and out-of-distribution perspective. CoRR, abs/2302.12095, 2023b. doi: 10.48550/arXiv.2302.12095. URL https://doi.org/10.48550/arXiv.2302.12095.
  329. Plan-and-Solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  2609–2634. Association for Computational Linguistics, 2023c. doi: 10.18653/v1/2023.acl-long.147. URL https://doi.org/10.18653/v1/2023.acl-long.147.
  330. Is chatgpt a good teacher coach? measuring zero-shot performance for scoring and providing actionable insights on classroom instruction. In Ekaterina Kochmar, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Nitin Madnani, Anaïs Tack, Victoria Yaneva, Zheng Yuan, and Torsten Zesch (eds.), Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@ACL 2023, Toronto, Canada, 13 July 2023, pp.  626–667. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.bea-1.53. URL https://doi.org/10.18653/v1/2023.bea-1.53.
  331. Recode: Robustness evaluation of code generation models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  13818–13843. Association for Computational Linguistics, 2023d. doi: 10.18653/v1/2023.acl-long.773. URL https://doi.org/10.18653/v1/2023.acl-long.773.
  332. From LSAT: the progress and challenges of complex reasoning. IEEE ACM Trans. Audio Speech Lang. Process., 30:2201–2216, 2022. doi: 10.1109/taslp.2022.3164218. URL https://doi.org/10.1109/taslp.2022.3164218.
  333. Modeling semantic plausibility by injecting world knowledge. In Marilyn A. Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pp.  303–308. Association for Computational Linguistics, 2018. doi: 10.18653/v1/n18-2049. URL https://doi.org/10.18653/v1/n18-2049.
  334. Toxicity detection with generative prompt-based inference. CoRR, abs/2205.12390, 2022. doi: 10.48550/arXiv.2205.12390. URL https://doi.org/10.48550/arXiv.2205.12390.
  335. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  13484–13508. Association for Computational Linguistics, 2023e. doi: 10.18653/v1/2023.acl-long.754. URL https://doi.org/10.18653/v1/2023.acl-long.754.
  336. Mind the GAP: A balanced corpus of gendered ambiguous pronouns. Trans. Assoc. Comput. Linguistics, 6:605–617, 2018. doi: 10.1162/tacl_a_00240. URL https://doi.org/10.1162/tacl_a_00240.
  337. Gendered ambiguous pronoun (gap) shared task at the gender bias in nlp workshop 2019. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp.  1–7, 2019.
  338. Jailbroken: How does LLM safety training fail? CoRR, abs/2307.02483, 2023a. doi: 10.48550/arXiv.2307.02483. URL https://doi.org/10.48550/arXiv.2307.02483.
  339. Chain-of-Thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
  340. CMATH: can your language model pass chinese elementary school math test? CoRR, abs/2306.16636, 2023b. doi: 10.48550/arXiv.2306.16636. URL https://doi.org/10.48550/arXiv.2306.16636.
  341. Constructing datasets for multi-hop reading comprehension across documents. Trans. Assoc. Comput. Linguistics, 6:287–302, 2018. doi: 10.1162/tacl_a_00021. URL https://doi.org/10.1162/tacl_a_00021.
  342. Dialogue natural language inference. In Anna Korhonen, David R. Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  3731–3741. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1363. URL https://doi.org/10.18653/v1/p19-1363.
  343. Henry M Wellman. The child’s theory of mind. The MIT Press, 1992.
  344. A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn A. Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp.  1112–1122. Association for Computational Linguistics, 2018. doi: 10.18653/v1/n18-1101. URL https://doi.org/10.18653/v1/n18-1101.
  345. Bloomberggpt: A large language model for finance. CoRR, abs/2303.17564, 2023. doi: 10.48550/ARXIV.2303.17564. URL https://doi.org/10.48550/arXiv.2303.17564.
  346. Are large language models really good logical reasoners? A comprehensive evaluation and beyond. CoRR, abs/2306.09841, 2023a. doi: 10.48550/arXiv.2306.09841. URL https://doi.org/10.48550/arXiv.2306.09841.
  347. A systematic evaluation of large language models of code. In Swarat Chaudhuri and Charles Sutton (eds.), MAPS@PLDI 2022: 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego, CA, USA, 13 June 2022, pp.  1–10. ACM, 2022a. doi: 10.1145/3520312.3534862. URL https://doi.org/10.1145/3520312.3534862.
  348. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pp.  1–10, 2022b.
  349. CLUE: A chinese language understanding evaluation benchmark. In Donia Scott, Núria Bel, and Chengqing Zong (eds.), Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pp.  4762–4772. International Committee on Computational Linguistics, 2020a. doi: 10.18653/v1/2020.coling-main.419. URL https://doi.org/10.18653/v1/2020.coling-main.419.
  350. On the tool manipulation capability of open-source large language models. CoRR, abs/2305.16504, 2023b. doi: 10.48550/arXiv.2305.16504. URL https://doi.org/10.48550/arXiv.2305.16504.
  351. End-to-end slot alignment and recognition for cross-lingual NLU. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 5052–5063. Association for Computational Linguistics, 2020b. doi: 10.18653/V1/2020.EMNLP-MAIN.410. URL https://doi.org/10.18653/v1/2020.emnlp-main.410.
  352. mt5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp.  483–498. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.naacl-main.41. URL https://doi.org/10.18653/v1/2021.naacl-main.41.
  353. Can neural networks understand monotonicity reasoning? In Tal Linzen, Grzegorz Chrupala, Yonatan Belinkov, and Dieuwke Hupkes (eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@ACL 2019, Florence, Italy, August 1, 2019, pp.  31–40. Association for Computational Linguistics, 2019a. doi: 10.18653/v1/W19-4804. URL https://doi.org/10.18653/v1/W19-4804.
  354. HELP: A dataset for identifying shortcomings of neural models in monotonicity reasoning. In Rada Mihalcea, Ekaterina Shutova, Lun-Wei Ku, Kilian Evang, and Soujanya Poria (eds.), Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics, *SEM@NAACL-HLT 2019, Minneapolis, MN, USA, June 6-7, 2019, pp.  250–255. Association for Computational Linguistics, 2019b. doi: 10.18653/v1/s19-1027. URL https://doi.org/10.18653/v1/s19-1027.
  355. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp.  2369–2380. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-1259. URL https://doi.org/10.18653/v1/d18-1259.
  356. Webshop: Towards scalable real-world web interaction with grounded language agents. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/82ad13ec01f9fe44c01cb91814fd7b8c-Abstract-Conference.html.
  357. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=WE_vluYUL-X.
  358. Do large language models know what they don’t know? In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  8653–8665. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.findings-acl.551. URL https://doi.org/10.18653/v1/2023.findings-acl.551.
  359. Legal prompting: Teaching a language model to think like a lawyer. CoRR, abs/2212.01326, 2022. doi: 10.48550/arXiv.2212.01326. URL https://doi.org/10.48550/arXiv.2212.01326.
  360. KoLA: Carefully benchmarking world knowledge of large language models. CoRR, abs/2306.09296, 2023. doi: 10.48550/arXiv.2306.09296. URL https://doi.org/10.48550/arXiv.2306.09296.
  361. Reclor: A reading comprehension dataset requiring logical reasoning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=HJgJtT4tvB.
  362. How well do large language models perform in arithmetic tasks? CoRR, abs/2304.02015, 2023. doi: 10.48550/arXiv.2304.02015. URL https://doi.org/10.48550/arXiv.2304.02015.
  363. Predicting the type and target of offensive posts in social media. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  1415–1420. Association for Computational Linguistics, 2019a. doi: 10.18653/v1/n19-1144. URL https://doi.org/10.18653/v1/n19-1144.
  364. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). In Jonathan May, Ekaterina Shutova, Aurélie Herbelot, Xiaodan Zhu, Marianna Apidianaki, and Saif M. Mohammad (eds.), Proceedings of the 13th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2019, Minneapolis, MN, USA, June 6-7, 2019, pp.  75–86. Association for Computational Linguistics, 2019b. doi: 10.18653/V1/S19-2010. URL https://doi.org/10.18653/v1/s19-2010.
  365. Chatgpt: Unlocking the future of nlp in finance. Available at SSRN 4323643, 2023.
  366. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p19-1472.
  367. GLM-130B: an open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a. URL https://openreview.net/pdf?id=-Aw0rrrPUF.
  368. Hui Zeng. Measuring massive multitask chinese understanding. CoRR, abs/2304.12986, 2023. doi: 10.48550/arXiv.2304.12986. URL https://doi.org/10.48550/arXiv.2304.12986.
  369. Evaluating the generation capabilities of large chinese language models. CoRR, abs/2308.04823, 2023b. doi: 10.48550/arXiv.2308.04823. URL https://doi.org/10.48550/arXiv.2308.04823.
  370. Alignscore: Evaluating factual consistency with A unified alignment function. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  11328–11348. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.634. URL https://doi.org/10.18653/v1/2023.acl-long.634.
  371. CORGI-PM: A chinese corpus for gender bias probing and mitigation. CoRR, abs/2301.00395, 2023a. doi: 10.48550/arXiv.2301.00395. URL https://doi.org/10.48550/arXiv.2301.00395.
  372. Personalizing dialogue agents: I have a dog, do you have pets too? In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 2204–2213. Association for Computational Linguistics, 2018. doi: 10.18653/V1/P18-1205. URL https://aclanthology.org/P18-1205/.
  373. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022. doi: 10.48550/ARXIV.2205.01068. URL https://doi.org/10.48550/arXiv.2205.01068.
  374. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. CoRR, abs/2306.05179, 2023b. doi: 10.48550/arXiv.2306.05179. URL https://doi.org/10.48550/arXiv.2306.05179.
  375. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Ingo Frommholz, Frank Hopfgartner, Mark Lee, Michael Oakes, Mounia Lalmas, Min Zhang, and Rodrygo L. T. Santos (eds.), Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, Birmingham, United Kingdom, October 21-25, 2023, pp. 4435–4439. ACM, 2023. doi: 10.1145/3583780.3615285. URL https://doi.org/10.1145/3583780.3615285.
  376. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. arXiv preprint arXiv:2305.12002, 2023c.
  377. DIALOGPT : Large-scale generative pre-training for conversational response generation. In Asli Celikyilmaz and Tsung-Hsien Wen (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2020, Online, July 5-10, 2020, pp.  270–278. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-demos.30. URL https://doi.org/10.18653/v1/2020.acl-demos.30.
  378. Gender bias in coreference resolution: Evaluation and debiasing methods. In Marilyn A. Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pp.  15–20. Association for Computational Linguistics, 2018. doi: 10.18653/v1/n18-2003. URL https://doi.org/10.18653/v1/n18-2003.
  379. Gender bias in multilingual embeddings and cross-lingual transfer. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 2896–2907. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.260. URL https://doi.org/10.18653/v1/2020.acl-main.260.
  380. Robut: A systematic study of table QA robustness against human-annotated adversarial perturbations. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  6064–6081. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.334. URL https://doi.org/10.18653/v1/2023.acl-long.334.
  381. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685, 2023. doi: 10.48550/arXiv.2306.05685. URL https://doi.org/10.48550/arXiv.2306.05685.
  382. Towards a unified multi-dimensional evaluator for text generation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  2023–2038. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.emnlp-main.131. URL https://doi.org/10.18653/v1/2022.emnlp-main.131.
  383. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103, 2017. URL http://arxiv.org/abs/1709.00103.
  384. Agieval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.
  385. "going on a vacation" takes longer than "going for a walk": A study of temporal commonsense understanding. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp.  3361–3367. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1332. URL https://doi.org/10.18653/v1/D19-1332.
  386. Temporal reasoning on implicit events from distant supervision. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp.  1361–1371. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.naacl-main.107. URL https://doi.org/10.18653/v1/2021.naacl-main.107.
  387. Towards identifying social bias in dialog systems: Frame, datasets, and benchmarks. CoRR, abs/2202.08011, 2022. URL https://arxiv.org/abs/2202.08011.
  388. Webarena: A realistic web environment for building autonomous agents. CoRR, abs/2307.13854, 2023. doi: 10.48550/arXiv.2307.13854. URL https://doi.org/10.48550/arXiv.2307.13854.
  389. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. CoRR, abs/2306.04528, 2023a. doi: 10.48550/arXiv.2306.04528. URL https://doi.org/10.48550/arXiv.2306.04528.
  390. Can chatgpt reproduce human-generated labels? A study of social computing tasks. CoRR, abs/2304.10145, 2023b. doi: 10.48550/ARXIV.2304.10145. URL https://doi.org/10.48550/arXiv.2304.10145.
  391. Toolqa: A dataset for LLM question answering with external tools. CoRR, abs/2306.13304, 2023. doi: 10.48550/arXiv.2306.13304. URL https://doi.org/10.48550/arXiv.2306.13304.
  392. The moral integrity corpus: A benchmark for ethical dialogue systems. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  3755–3773. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.261. URL https://doi.org/10.18653/v1/2022.acl-long.261.
  393. Evaluation of chatgpt and bert-based models for turkish hate speech detection. In 2023 8th International Conference on Computer Science and Engineering (UBMK), pp.  229–233, 2023. doi: 10.1109/UBMK59864.2023.10286663.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Zishan Guo (5 papers)
  2. Renren Jin (17 papers)
  3. Chuang Liu (71 papers)
  4. Yufei Huang (81 papers)
  5. Dan Shi (4 papers)
  6. Linhao Yu (10 papers)
  7. Yan Liu (419 papers)
  8. Jiaxuan Li (52 papers)
  9. Bojian Xiong (1 paper)
  10. Deyi Xiong (103 papers)
  11. Supryadi (5 papers)
Citations (136)
Youtube Logo Streamline Icon: https://streamlinehq.com