Emergent Mind

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

(2206.04615)
Published Jun 9, 2022 in cs.CL , cs.AI , cs.CY , cs.LG , and stat.ML

Abstract

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
Comparison of top and average human raters vs. best model on BIG-bench Lite tasks.

Overview

  • BIG-bench provides a comprehensive benchmark for language models with 204 diverse tasks across multiple domains, aiming to quantify and qualify model behaviors.

  • The benchmark evaluates models from Google and OpenAI, among others, using dense and sparse transformer architectures against a human expert baseline, focusing on performance, calibration, bias, and robustness.

  • Key findings include the correlation of performance improvement with model scale, sensitivity to task framing, the amplification of social biases in larger models, and underperformance in tasks involving low-resource languages.

  • The insights from BIG-bench inform future research directions in model calibration, bias mitigation, development of robust models, exploration into architectures, and inclusivity in data representation.

Introduction

The capabilities of language models (LMs) evolve rapidly, continually setting new benchmarks that challenge our understanding of AI's potential. The introduction of the Beyond the Imitation Game (BIG-bench) benchmark seeks to address critical gaps in existing benchmarks for language models. BIG-bench stands out through its extensive inclusion of 204 diverse tasks spanning various domains such as linguistics, mathematics, commonsense reasoning, and even tasks like code debugging and chess move prediction. It aims to quantify model behaviors both qualitatively and quantitatively, offering a novel insight into the capabilities and limitations of modern language models across a broad spectrum of parameters.

Evaluation Methodology

The paper reports on evaluations conducted across models of varying complexities, including those from Google and OpenAI that range from millions to hundreds of billions of parameters. Notably, these evaluations include the use of dense transformers and sparse transformer architectures. The benchmark also incorporates a human expert baseline to provide context for the model's performance. In doing so, BIG-bench contributes significantly to the discourse on LM capabilities by not just focusing on task performance but also on the models' calibration, bias, and robustness to task presentation.

Key Findings and Implications

Performance Trends and Task Breakthroughs

One of the primary observations from the benchmark is the considerable improvement in performance correlating with model scale. Despite this trend, it's essential to note that all models, irrespective of their size, demonstrated considerable deficiencies when compared to expert human performance. The analysis uncovers instances of "breakthrough" behavior, where model performance on specific tasks improves dramatically beyond a certain model scale. This phenomenon indicates a nonlinear scaling behavior in LMs, especially in tasks involving multi-step reasoning or those with narrow success metrics.

Sensitivity to Task Framing

The benchmark elucidates the models' brittleness, highlighted by their performance fluctuation based on task framing. Such findings prompt a reevaluation of model robustness and the potential need for models that can generalize across various framings of essentially the same task.

Social Bias

A disconcerting finding is the amplification of social biases in models as they scale, especially in tasks set in broad or ambiguous contexts. This underscores the critical need for continued emphasis on ethical AI development practices, focusing on fairness and the mitigation of biases.

Language and Domain Coverage

BIG-bench showcases a pronounced performance disparity in tasks across different languages, particularly highlighting the models' underperformance in tasks involving low-resource languages. This gap accentuates the importance of inclusivity in data representation for training models that are truly global.

Future Directions

The insights from BIG-bench provide a roadmap for future research in LMs, emphasizing the importance of model calibration, the mitigation of biases, and the development of more robust models. Additionally, the emergence of breakthrough behaviors and the sensitivity to task framing underscore the need for continued exploration into model architectures and training procedures. Moreover, the performance gap in tasks involving low-resource languages and specific domains points to the need for a more inclusive approach in data procurement and model training.

Conclusion

BIG-bench marks a significant advancement in the pursuit of understanding LLMs' capabilities and limitations. By encompassing a wide range of tasks and evaluating models of varying scales, it delivers comprehensive insights into the current state of LMs. The findings highlight the complexities of model scaling, sensitivity to task framing, and the societal implications of model biases. As LMs continue to evolve, benchmarks like BIG-bench will be pivotal in guiding the development of more capable, equitable, and robust AI systems.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

YouTube
References
  1. Wikiquote, russian proverbs. https://ru.wikiquote.org/wiki/%D0%A0%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0%B5_%D0%BF%D0%BE%D1%81%D0%BB%D0%BE%D0%B2%D0%B8%D1%86%D1%8B.

  2. Persistent Anti-Muslim Bias in Large Language Models
  3. A Survey of Neural Networks and Formal Languages
  4. VQA: Visual Question Answering
  5. Learning Convex Optimization Models
  6. Scott Alexander. A very unlikely chess game, 2020. https://slatestarcodex.com/2020/01/06/a-very-unlikely-chess-game/.

  7. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, July 2019. Association for Computing Machinery. doi: 10.1145/3331184.3331265. https://dl.acm.org/doi/10.1145/3331184.3331265.
  8. Learning Continuous Semantic Representations of Symbolic Expressions
  9. A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51(4), July 2018. doi: 10.1145/3212695. https://doi.org/10.1145/3212695.

  10. code2seq: Generating Sequences from Structured Representations of Code
  11. Structural language models of code. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  245--256. PMLR, 13--18 July 2020. https://proceedings.mlr.press/v119/alon20a.html.

  12. A survey on approaches to computational humor generation. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp.  29--41, Online, December 2020. International Committee on Computational Linguistics. https://aclanthology.org/2020.latechclfl-1.4.

  13. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2357--2367, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1245. https://aclanthology.org/N19-1245.

  14. Toward Automated Quest Generation in Text-Adventure Games
  15. Bringing Stories Alive: Generating Interactive Fiction Worlds
  16. Concrete Problems in AI Safety
  17. OptNet: Differentiable Optimization as a Layer in Neural Networks
  18. Philip W. Anderson. More is different. Science, 177(4047):393--396, 1972. doi: 10.1126/science.177.4047.393. https://www.science.org/doi/abs/10.1126/science.177.4047.393.

  19. ColBERT: Using BERT Sentence Embedding in Parallel Neural Networks for Computational Humor
  20. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4623--4637, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.421. https://aclanthology.org/2020.acl-main.421.

  21. Efficient Large Scale Language Modeling with Mixtures of Experts
  22. Big BiRD: A large, fine-grained, bigram relatedness dataset for examining semantic composition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  505--516, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1050. https://aclanthology.org/N19-1050.

  23. A General Language Assistant as a Laboratory for Alignment
  24. Generating fact checking explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7352--7364, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.656. https://aclanthology.org/2020.acl-main.656.

  25. Salvatore Attardo. Humor in language. In Oxford Research Encyclopedia of Linguistics. Oxford University Press, 2017. doi: 10.1093/acrefore/9780199384655.013.342. https://oxfordre.com/linguistics/view/10.1093/acrefore/9780199384655.001.0001/acrefore-9780199384655-e-342.

  26. Program Synthesis with Large Language Models
  27. Celex2 ldc96l14, 1995. https://doi.org/10.35111/gs6s-gm48.

  28. Explaining Neural Scaling Laws
  29. Real or Fake? Learning to Discriminate Machine from Human Generated Text
  30. DeepCoder: Learning to Write Programs
  31. Extended gloss overlaps as a measure of semantic relatedness. In IJCAI’03: Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp.  805–810, San Francisco, 2003. Morgan Kaufmann. doi: 10.5555/1630659.1630775. https://dl.acm.org/doi/10.5555/1630659.1630775.
  32. ITEM2VEC: Neural item embedding for collaborative filtering. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pp.  1--6, Piscataway, NJ, 2016. Institute of Electrical and Electronics Engineers. doi: 10.1109/MLSP.2016.7738886. https://ieeexplore.ieee.org/document/7738886.
  33. Big data’s disparate impact. California Law Review, 104(3):671--732, 2016. http://www.jstor.org/stable/24758720.

  34. Teaching classification boundaries to humans. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 27, pp.  109--115, Menlo Park, CA, June 2013. Association for the Advancement of Artificial Intelligence. https://ojs.aaai.org/index.php/AAAI/article/view/8623.

  35. The Pushshift Reddit Dataset
  36. The relationship between inference skills and reading comprehension. TED EĞİTİM VE BİLİM (Education and Science), 45(203):177--190, 2020. doi: 10.15390/EB.2020.8782. http://egitimvebilim.ted.org.tr/index.php/EB/article/view/8782.
  37. Neural path planning: Fixed time, near-optimal path generation via oracle imitation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  3965--3972, Piscataway, NJ, 2019. Institute of Electrical and Electronics Engineers. doi: 10.1109/IROS40897.2019.8968089. https://ieeexplore.ieee.org/document/8968089.
  38. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5185--5198, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.463. https://aclanthology.org/2020.acl-main.463.

  39. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp.  610–623, New York, NY, USA, 2021. Association for Computing Machinery. doi: 10.1145/3442188.3445922. https://doi.org/10.1145/3442188.3445922.
  40. Jean Berko. The child’s learning of english morphology. <i>WORD</i>, 14(2-3):150--177, 1958. doi: 10.1080/00437956.1958.11659661. https://doi.org/10.1080/00437956.1958.11659661.
  41. Neural-Symbolic Learning and Reasoning: A Survey and Interpretation
  42. Critical Thinking for Language Models
  43. Abductive Commonsense Reasoning
  44. Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge
  45. On the ability and limitations of transformers to recognize formal languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7096--7116, Online, November 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.576. https://aclanthology.org/2020.emnlp-main.576.

  46. On the practical ability of recurrent neural networks to recognize hierarchical languages. In Proceedings of the 28th International Conference on Computational Linguistics, pp.  1481--1494, Barcelona, Spain (Online), December 2020b. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.129. https://aclanthology.org/2020.coling-main.129.

  47. Deep API Programmer: Learning to Program with APIs
  48. Fooling MOSS Detection with Pretrained Language Models
  49. Alan W. Biermann. The inference of regular LISP programs from examples. IEEE Transactions on Systems, Man, and Cybernetics, 8(8):585--600, 1978. doi: 10.1109/TSMC.1978.4310035. https://ieeexplore.ieee.org/document/4310035.
  50. Multimodal datasets: misogyny, pornography, and malignant stereotypes
  51. A clustering approach for nearly unsupervised recognition of nonliteral language. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pp.  329--336, Trento, Italy, April 2006. Association for Computational Linguistics. https://aclanthology.org/E06-1042.

  52. The Importance of Suppressing Domain Style in Authorship Analysis
  53. PIQA: reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  7423--7439, Menlo Park, CA, 2020. Association for the Advancement of Artificial Intelligence. doi: 10.1609/aaai.v34i05.6239. https://ojs.aaai.org/index.php/AAAI/article/view/6239.

  54. Predicting human metaphor paraphrase judgments with deep neural networks. In Proceedings of the Workshop on Figurative Language Processing, pp.  45--55, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-0906. https://aclanthology.org/W18-0906.

  55. GPT-NeoX-20B: An Open-Source Autoregressive Language Model
  56. Large dataset and language model fun-tuning for humor recognition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4027--4032, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1394. https://aclanthology.org/P19-1394.

  57. Language (technology) is power: A critical survey of ‘‘bias’’ in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5454--5476, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.485. https://aclanthology.org/2020.acl-main.485.

  58. Yulia V. Bodrova. Russian Proverbs and Sayings and Their English Equivalents. AST, Moscow
  59. Nicholas Boillot. Vector forms as a foreign language, 24 June 2019. https://www.fluate.net/en/travaux/vectoglyph.

  60. Paul F. Boller, Jr. and John George. They Never Said It: A Book of Fake Quotes, Misquotes, and Misleading Attributions. Oxford University Press, Oxford
  61. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. https://proceedings.neurips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html.

  62. On the Opportunities and Risks of Foundation Models
  63. Identifying and Reducing Gender Bias in Word-Level Language Models
  64. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction
  65. D33{}{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT data-driven documents. IEEE Transactions on Visualization and Computer Graphics, 17(12):2301--2309, 2011. doi: 10.1109/TVCG.2011.185. https://ieeexplore.ieee.org/document/6064996.

  66. What will it take to fix benchmarking in natural language understanding? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4843--4855, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.385. https://aclanthology.org/2021.naacl-main.385.

  67. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  632--642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. https://aclanthology.org/D15-1075.

  68. Rosetta stone linguistic problems. In Proceedings of the Fourth Workshop on Teaching NLP and CL, pp.  1--8, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. https://aclanthology.org/W13-3401.

  69. Programming with a Differentiable Forth Interpreter
  70. Gwern Branwen. GPT-3 creative fiction. Gwern.net, June 2020. https://www.gwern.net/GPT-3.

  71. Finding microaggressions in the wild: A case for locating elusive phenomena in social media posts. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  1664--1674, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1176. https://aclanthology.org/D19-1176.

  72. Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1--3
  73. Ralf Brown. Non-linear mapping for improved identification of 1300+ languages. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  627--632, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1069. https://aclanthology.org/D14-1069.

  74. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877--1901. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.

  75. Ravens attribute visual access to unseen competitors. Nature Communications, 7:article 10506, 2016. https://www.nature.com/articles/ncomms10506.

  76. The WMT’18 morpheval test suites for English-Czech, English-German, English-Finnish and Turkish-English. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp.  546--560, Belgium, Brussels, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6433. https://aclanthology.org/W18-6433.

  77. Corrado Böhm. On a family of Turing machines and the related programming language. ICC Bulletin, 3:187--194
  78. Inference making ability and its relation to comprehension failure. Reading and Writing, 11(5–6):489--503, 1999. doi: 10.1023/A:1008084120205. https://link.springer.com/article/10.1023/A:1008084120205.

  79. Eliciting good teaching from humans for machine learners. Artificial Intelligence, 217:198--215, 2014. doi: https://doi.org/10.1016/j.artint.2014.08.005. https://www.sciencedirect.com/science/article/pii/S0004370214001143.

  80. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183--186, 2017. doi: 10.1126/science.aal4230. https://www.science.org/doi/abs/10.1126/science.aal4230.

  81. Does the chimpanzee have a theory of mind? 30 years later. Trends in Cognitive Sciences, 12:187--192, 2008. doi: 10.1016/j.tics.2008.02.010. https://doi.org/10.1016/j.tics.2008.02.010.

  82. Tracy Canfield. Machine translation of Klingon, 2010. http://klingonska.org/academic/canfield-2010-machine_translation_of_klingon.pdf.

  83. Extracting training data from LLMs. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633--2650. USENIX Association, August 2021. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting.

  84. Nathanael Chambers. Labeling documents with timestamps: Learning from their time expressions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  98--106, Jeju Island, Korea, July 2012. Association for Computational Linguistics. https://aclanthology.org/P12-1011.

  85. Studying cultural differences in emoji usage across the East and the West. In Proceedings of the International AAAI Conference on Web and Social Media, volume 13, pp.  226--235, Menlo Park, CA, Jul. 2019. Association for the Advancement of Artificial Intelligence. https://ojs.aaai.org/index.php/ICWSM/article/view/3224.

  86. Simplicity: A unifying principle in cognitive science? Trends in Cognitive Sciences, 7:19--22, 2003. doi: 10.1016/S1364-6613(02)00005-0. https://doi.org/10.1016/S1364-6613(02)00005-0.

  87. Developing self-awareness in robots via inner speech. Frontiers in Robotics and AI, 7, 2020. doi: 10.3389/frobt.2020.00016. https://www.frontiersin.org/article/10.3389/frobt.2020.00016.
  88. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  12530--12539, Piscataway, NJ, 2019a. Institute of Electrical and Electronics Engineers. doi: 10.1109/CVPR.2019.01282. https://ieeexplore.ieee.org/document/8954308.
  89. Generative pretraining from pixels. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  1691--1703. PMLR, 13--18 July 2020. https://proceedings.mlr.press/v119/chen20s.html.

  90. Evaluating Large Language Models Trained on Code
  91. Humor recognition using deep learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.  113--117, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2018. https://aclanthology.org/N18-2018.

  92. Ricson Chen. Transformers play chess, 2020. https://github.com/ricsonc/transformers-play-chess.

  93. Execution-guided neural program synthesis. https://openreview.net/pdf?id=H1gfOiAqYm, 2019b.

  94. Generating Long Sequences with Sparse Transformers
  95. On measuring gender bias in translation of gender-neutral pronouns. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp.  173--181, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-3824. https://aclanthology.org/W19-3824.

  96. QuAC : Question Answering in Context
  97. On the Measure of Intelligence
  98. François Chollet. Abstraction and reasoning challenge, 2020. https://www.kaggle.com/c/abstraction-and-reasoning-challenge.

  99. The algebraic theory of context-free languages. In P. Braffort and D. Hirschberg (eds.), Computer Programming and Formal Systems, volume 26 of Studies in Logic and the Foundations of Mathematics, pp.  118--161. Elsevier, 1959. doi: https://doi.org/10.1016/S0049-237X(09)70104-1. https://www.sciencedirect.com/science/article/pii/S0049237X09701041.

  100. PaLM: Scaling Language Modeling with Pathways
  101. CycleGAN, a Master of Steganography
  102. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370--383, 2007. doi: 10.1109/TKDE.2007.48.
  103. Unified Scaling Laws for Routed Language Models
  104. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
  105. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  106. Transformers as Soft Reasoners over Language
  107. Self recognition in a jumping spider: Portia labiata females discriminate between their own draglines and those of conspecifics. Ethology Ecology & Evolution, 6(3):371--375, 1994. doi: 10.1080/08927014.1994.9522987. https://doi.org/10.1080/08927014.1994.9522987.
  108. Training Verifiers to Solve Math Word Problems
  109. General-purpose Declarative Inductive Programming with Domain-Specific Background Knowledge for Data Wrangling Automation
  110. Automated data transformation with inductive programming and dynamic background knowledge. In Ulf Brefeld, Elisa Fromont, Andreas Hotho, Arno Knobbe, Marloes Maathuis, and Céline Robardet (eds.), Machine Learning and Knowledge Discovery in Databases, pp.  735--751, Cham, 2020. Springer. doi: 10.1007/978-3-030-46133-144. https://doi.org/10.1007/978-3-030-46133-144.

  111. Introduction to Logic. Taylor & Francis, 2018. https://books.google.co.il/books?id=38bADwAAQBAJ.

  112. Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces
  113. Kate Crawford. The trouble with bias. https://www.youtube.com/watch?v=fMym_BKWQzk, 2017. Keynote address, NIPS 2017, Long Beach CA. Dec. 5

  114. Metagol system, 2016. https://github.com/metagol/metagol.

  115. Meta-interpretive learning of data transformation programs. In Katsumi Inoue, Hayato Ohwada, and Akihiro Yamamoto (eds.), Inductive Logic Programming, pp.  46--59, Cham, 2016. Springer. doi: 10.1007/978-3-319-40566-74. https://doi.org/10.1007/978-3-319-40566-74.

  116. Learning higher-order logic programs. Machine Learning, 109:1289--1322, 2020. doi: 10.1007/s10994-019-05862-7. https://doi.org/10.1007/s10994-019-05862-7.

  117. Joe Cruse. Emoji usage in TV conversation. Twitter blog, 18 Nov 2015. https://blog.twitter.com/en_us/a/2015/emoji-usage-in-tv-conversation.

  118. TextWorld: A Learning Environment for Text-based Games
  119. Jim Daley. White Chicago cops use force more often than Black officers. Scientific American, 11 February 2021. https://www.scientificamerican.com/article/white-chicago-cops-use-force-more-often-than-black-officers/.

  120. Playing Text-Based Games with Common Sense
  121. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp.  933–941, New York, NY, USA, 2017. Association for Computing Machinery. doi: 10.5555/3305381.3305478. https://dl.acm.org/doi/10.5555/3305381.3305478.
  122. Wayne Davis. Implicature. In Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Fall 2019 edition
  123. Finding contradictions in text. In Proceedings of ACL-08: HLT, pp.  1039--1047, Columbus, Ohio, June 2008. Association for Computational Linguistics. https://aclanthology.org/P08-1118.

  124. Did it happen? The pragmatic complexity of veridicality assessment. Computational Linguistics, 38(2):301--333, June 2012. doi: 10.1162/COLIa00097. https://aclanthology.org/J12-2003.

  125. The CommitmentBank: Investigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung, 23(2):107--124, July 2019. doi: 10.18148/sub/2019.v23i2.601. https://ojs.ub.uni-konstanz.de/sub/index.php/sub/article/view/601.

  126. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990. doi: 10.1002/(SICI)1097-4571(199009)41:6391::AID-ASI13.0.CO;2-9. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6391::AID-ASI13.0.CO;2-9.

  127. When redundancy is useful: A Bayesian approach to “overinformative” referring expressions. Psychological Review, 127:591--621, 2020. doi: 10.1037/rev0000186. https://doi.org/10.1037/rev0000186.

  128. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  295--302, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.21. https://aclanthology.org/2020.emnlp-main.21.

  129. On measuring and mitigating biased inferences of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  7659--7666, Menlo Park, CA, Apr. 2020. Association for the Advancement of Artificial Intelligence. doi: 10.1609/aaai.v34i05.6267. https://ojs.aaai.org/index.php/AAAI/article/view/6267.

  130. RobustFill: Neural program learning under noisy I/O. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp.  990--998, New York, NY, USA, 2017. Association for Computing Machinery. doi: 10.5555/3305381.3305484. https://dl.acm.org/doi/10.5555/3305381.3305484.
  131. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  132. Quasar: Datasets for Question Answering by Search and Reading
  133. Sequence-based prediction of protein--protein interaction sites with l1-logreg classifier. Journal of Theoretical Biology, 348:47--54, 2014. doi: 10.1016/j.jtbi.2014.01.028. https://pubmed.ncbi.nlm.nih.gov/24486250/.

  134. NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
  135. Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation
  136. Learning Syllogism with Euler Neural-Networks
  137. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
  138. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2368--2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. https://aclanthology.org/N19-1246.

  139. RoFT: A tool for evaluating human detection of machine-generated text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  189--196, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.25. https://aclanthology.org/2020.emnlp-demos.25.

  140. To test machine comprehension, start by defining comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7839--7859, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.701. https://aclanthology.org/2020.acl-main.701.

  141. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5055--5070, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.454. https://aclanthology.org/2020.acl-main.454.

  142. Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding
  143. How Can Self-Attention Networks Recognize Dyck-n Languages?
  144. Misspelling Oblivious Word Embeddings
  145. Compositional morpheme embeddings with affixes as functions and stems as arguments. In Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP, pp.  1--5, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-2901. https://aclanthology.org/W18-2901.

  146. Cryptonite: A cryptic crossword benchmark for extreme ambiguity in language. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  4186--4192, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.344. https://aclanthology.org/2021.emnlp-main.344.

  147. Semantic relatedness of Wikipedia concepts -- benchmark data and a working solution. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association. https://aclanthology.org/L18-1408.

  148. emoji2vec: Learning emoji representations from their description. In Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, pp.  48--54, Austin, TX, USA, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-6208. https://aclanthology.org/W16-6208.

  149. Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings
  150. Semantic Sort: A Supervised Approach to Personalized Semantic Relatedness
  151. Measuring and Improving Consistency in Pretrained Language Models
  152. Learning to learn programs from examples: Going beyond program structure. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp.  1638--1645, 2017. doi: 10.24963/ijcai.2017/227. https://doi.org/10.24963/ijcai.2017/227.

  153. DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning
  154. Can Neural Networks Understand Logical Entailment?
  155. Making sense of sensory input
  156. Question answering as an automatic evaluation metric for news article summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  3938--3948, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1395. https://aclanthology.org/N19-1395.

  157. Text Editing by Command
  158. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  889--898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. https://aclanthology.org/P18-1082.

  159. Beyond English-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1--48, 2021. http://jmlr.org/papers/v22/20-1307.html.

  160. Humor detection via an internal and external neural network. Neurocomputing, 394:105--111, 2020. doi: https://doi.org/10.1016/j.neucom.2020.02.030. https://www.sciencedirect.com/science/article/pii/S0925231220302058.

  161. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
  162. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2017. doi: 10.18653/v1/d17-1169. https://doi.org/10.18653/v1/D17-1169.

  163. Christiane Fellbaum (ed.). WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA
  164. Synthesizing data structure transformations from input-output examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’15, pp.  229–239, New York, NY, USA, 2015. Association for Computing Machinery. doi: 10.1145/2737924.2737977. https://doi.org/10.1145/2737924.2737977.
  165. Susan T. Fiske. Controlling other people: The impact of power on stereotyping. American Psychologist, 48:621--628, 1993. doi: 10.1037/0003-066X.48.6.621. https://doi.org/10.1037/0003-066X.48.6.621.

  166. The Cattell-Horn-Carroll theory of cognitive abilities. In Encyclopedia of Special Education. John Wiley & Sons, Ltd, 2014. doi: https://doi.org/10.1002/9781118660584.ese0431. https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118660584.ese0431.

  167. An introduction to inductive programming. Artificial Intelligence Review, 29:45--62, 2008. doi: 10.1007/s10462-009-9108-7. https://doi.org/10.1007/s10462-009-9108-7.

  168. Jerry A. Fodor. The Language of Thought. Harvard University Press, Cambridge, MA
  169. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1):3--71, 1988. doi: https://doi.org/10.1016/0010-0277(88)90031-5. https://www.sciencedirect.com/science/article/pii/0010027788900315.

  170. Mark Forsyth. The Elements of Eloquence: Secrets of the Perfect Turn of Phrase. Berkley, New York
  171. Whodunnit? Crime drama as a case for natural language understanding. Transactions of the Association for Computational Linguistics, 6:1--15, 2018. doi: 10.1162/tacla00001. https://aclanthology.org/Q18-1001.

  172. “The penny drops”: Investigating insight through the medium of cryptic crosswords. Frontiers in Psychology, 9, 2018. doi: 10.3389/fpsyg.2018.00904. https://www.frontiersin.org/article/10.3389/fpsyg.2018.00904.
  173. Martins Frolovs. Teaching GPT-2 transformer a sense of humor: How to fine-tune large transformer models on a single GPU in PyTorch. Towards Data Science, Medium, 2019. https://towardsdatascience.com/teaching-gpt-2-a-sense-of-humor-fine-tuning-large-transformer-models-on-a-single-gpu-in-pytorch-59e8cec40912.

  174. GO FIGURE: A Meta Evaluation of Factuality in Summarization
  175. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In IJCAI’07: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp.  1606–1611, San Francisco, 2007. Morgan Kaufmann. doi: 10.5555/1625275.1625535. https://dl.acm.org/doi/10.5555/1625275.1625535.
  176. Predictability and Surprise in Large Generative Models
  177. Neural metaphor detection in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  607--613, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1060. https://aclanthology.org/D18-1060.

  178. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  179. EleutherAI/lm-evaluation-harness: v0.2.0, March 2022. https://doi.org/10.5281/zenodo.6332975.

  180. Neurosymbolic AI: The 3rd Wave
  181. Evaluating Models' Local Decision Boundaries via Contrast Sets
  182. The TUNA-REG challenge 2009: Overview and evaluation results. In Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009), pp.  174--182, Athens, Greece, March 2009. Association for Computational Linguistics. https://aclanthology.org/W09-0629.

  183. TerpreT: A Probabilistic Programming Language for Program Induction
  184. Knowledge-aware assessment of severity of suicide risk for early intervention. In The World Wide Web Conference, WWW ’19, pp.  514–525, New York, NY, USA, 2019. Association for Computing Machinery. doi: 10.1145/3308558.3313698. https://doi.org/10.1145/3308558.3313698.
  185. SyntaxGym: An online platform for targeted evaluation of language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  70--76, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-demos.10. https://aclanthology.org/2020.acl-demos.10.

  186. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  3356--3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. https://aclanthology.org/2020.findings-emnlp.301.

  187. The GEM benchmark: Natural language generation, its evaluation and metrics. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pp.  96--120, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.gem-1.10. https://aclanthology.org/2021.gem-1.10.

  188. The roles of similarity in transfer: Separating retrievability from inferential soundness. Cognitive Psychology, 25(4):524--575, 1993. doi: https://doi.org/10.1006/cogp.1993.1013. https://www.sciencedirect.com/science/article/pii/S0010028583710133.
  189. Conversational implicatures in English dialogue: Annotated dataset. Procedia Computer Science, 171:2316--2323, 2020. doi: https://doi.org/10.1016/j.procs.2020.04.251. https://www.sciencedirect.com/science/article/pii/S1877050920312436. Special issue: Third International Conference on Computing and Network Communications (CoCoNet’19).

  190. Injecting numerical reasoning skills into language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  946--958, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.89. https://aclanthology.org/2020.acl-main.89.

  191. Transformer Feed-Forward Layers Are Key-Value Memories
  192. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics, 9:346--361, 04 2021. doi: 10.1162/tacla00370. https://doi.org/10.1162/tacl_a_00370.

  193. Irony detection in a multilingual context. In Joemon M. Jose, Emine Yilmaz, João Magalhães, Pablo Castells, Nicola Ferro, Mário J. Silva, and Flávio Martins (eds.), Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science, volume 12036. Springer, Cham, 2020. https://link.springer.com/chapter/10.1007/978-3-030-45442-5_18.

  194. A Report on the 2020 Sarcasm Detection Shared Task
  195. ePiC: Employing Proverbs in Context as a Benchmark for Abstract Language Understanding
  196. Color naming across languages reflects color use. Proceedings of the National Academy of Sciences, 114(40):10785--10790, 2017. doi: 10.1073/pnas.1619666114. https://www.pnas.org/doi/abs/10.1073/pnas.1619666114.

  197. Dr.Fill: Crosswords and an Implemented Solver for Singly Weighted CSPs
  198. Assessing BERT's Syntactic Abilities
  199. Arthur S. Goldberger. Structural equation methods in the social sciences. Econometrica, 40(6):979--1001, 1972. http://www.jstor.org/stable/1913851.

  200. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  609--614, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1061. https://aclanthology.org/N19-1061.

  201. Identifying sarcasm in Twitter: A closer look. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  581--586, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. https://aclanthology.org/P11-2102.

  202. Pragmatic language interpretation as probabilistic inference. Trends in Cognitive Sciences, 20:818--829, 2016. doi: 10.1016/j.tics.2016.08.005. https://doi.org/10.1016/j.tics.2016.08.005.

  203. Topical-chat: Towards knowledge-grounded open-domain conversations. In Proc. Interspeech 2019, pp.  1891--1895, 2019. doi: 10.21437/Interspeech.2019-3079. https://www.isca-speech.org/archive/interspeech_2019/gopalakrishnan19_interspeech.html.

  204. Are neural open-domain dialog systems robust to speech recognition errors in the dialog history? An empirical study. In Proc. Interspeech 2020, pp.  911--915, 2020. doi: 10.21437/Interspeech.2020-1508. https://www.isca-speech.org/archive/interspeech_2020/gopalakrishnan20_interspeech.html.

  205. Andrew S. Gordon. Choice of plausible alternatives (COPA), 2010. https://people.ict.usc.edu/~gordon/copa.html.

  206. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34, 2003. doi: 10.35111/0z6y-q265. https://doi.org/10.35111/0z6y-q265.

  207. Neural Turing Machines
  208. Hybrid computing using a neural network with dynamic external memory. Nature, 538:471--–476, 2016. doi: 10.1038/nature20101. https://doi.org/10.1038/nature20101.

  209. Progress report on program-understanding systems (AIM-240), 1974. http://infolab.stanford.edu/pub/cstr/reports/cs/tr/74/444/CS-TR-74-444.pdf.

  210. Cordell Green. Application of theorem proving to problem solving. In Bonnie Lynn Webber and Nils J. Nilsson (eds.), Readings in Artificial Intelligence, pp.  202--222. Morgan Kaufmann, 1981. doi: https://doi.org/10.1016/B978-0-934613-03-3.50019-2. https://www.sciencedirect.com/science/article/pii/B9780934613033500192.

  211. In defense of a dogma. The Philosophical Review, 65(2):141--158, 1956. http://www.jstor.org/stable/2182828.

  212. Stochastic Optimization of Sorting Networks via Continuous Relaxations
  213. Universal Neural Machine Translation for Extremely Low Resource Languages
  214. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1195--1205, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1108. https://aclanthology.org/N18-1108.

  215. Sumit Gulwani. Automating string processing in spreadsheets using input-output examples. SIGPLAN Not., 46(1):317–330, Jan. 2011. doi: 10.1145/1925844.1926423. https://doi.org/10.1145/1925844.1926423.
  216. Spreadsheet data manipulation using examples. Commun. ACM, 55(8):97–105, Aug. 2012. doi: 10.1145/2240236.2240260. https://doi.org/10.1145/2240236.2240260.
  217. Inductive programming meets the real world. Commun. ACM, 58(11):90–99, Oct. 2015. doi: 10.1145/2736282. https://doi.org/10.1145/2736282.

  218. Program synthesis. Foundations and Trends in Programming Languages, 4(1–2):1--119, 2017a. doi: 10.1561/2500000010. http://dx.doi.org/10.1561/2500000010.

  219. Program Synthesis. NOW, Boston, 2017b. https://www.microsoft.com/en-us/research/wp-content/uploads/2017/10/program_synthesis_now.pdf.

  220. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  1321--1330. PMLR, 06--11 Aug. 2017. https://proceedings.mlr.press/v70/guo17a.html.

  221. Disfl-QA: A benchmark dataset for understanding disfluencies in question answering. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  3309--3319, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.293. https://aclanthology.org/2021.findings-acl.293.

  222. English Proverbs and Sayings. Vysshaya shkola, Moscow
  223. Samuel Gyasi Obeng. The proverb as a mitigating and politeness strategy in Akan discourse. Anthropological Linguistics, 38(3):521--549, 1996. http://www.jstor.org/stable/30028601.

  224. The argument reasoning comprehension task: Identification and reconstruction of implicit warrants. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1930--1940, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1175. https://aclanthology.org/N18-1175.

  225. Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156--171, Dec. 2020. doi: 10.1162/tacla00306. https://doi.org/10.1162/tacl_a_00306.

  226. It’s all in the name: Mitigating gender bias with name-based counterfactual data substitution. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  5267--5275, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1530. https://aclanthology.org/D19-1530.

  227. Joseph Y. Halpern. Actual causality. MIT Press, Cambridge, MA
  228. ECONET: Effective Continual Pretraining of Language Models for Event Temporal Reasoning
  229. Maria Hanzén. When in Rome, do as the Romans do: Proverbs as a part of EFL teaching. Master’s thesis, Jönköping University, School of Education and Communication, Jönköping, 2007. http://www.diva-portal.org/smash/get/diva2:3499/fulltext01.pdf.

  230. Context-Free Transductions with Neural Stacks
  231. Francesca G.E. Happé. An advanced test of theory of mind: Understanding of story characters thoughts and feelings by able autistic, mentally handicapped, and normal children and adults. Journal of Autism and Developmental Disorders, 24:129--154, 1994. https://link.springer.com/article/10.1007/BF02172093.

  232. The MovieLens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4), Dec. 2015. doi: 10.1145/2827872. https://doi.org/10.1145/2827872.

  233. Policy-driven neural response generation for knowledge-grounded dialog systems. In Proceedings of the 13th International Conference on Natural Language Generation, pp.  412--421, Dublin, Ireland, Dec. 2020. Association for Computational Linguistics. https://aclanthology.org/2020.inlg-1.46.

  234. A survey on recent approaches for natural language processing in low-resource scenarios. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2545--2568, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.201. https://aclanthology.org/2021.naacl-main.201.

  235. Irene Heim. On the projection problem for presuppositions. In Paul Portner and Barbara H. Partee (eds.), Formal Semantics - The Essential Readings, pp.  249--260. Blackwell, Oxford
  236. Tracking the World State with Recurrent Entity Networks
  237. Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  771--787, Cham, 2018. Springer. https://www.ecva.net/papers/eccv_2018/papers_ECCV/papers/Lisa_Anne_Hendricks_Women_also_Snowboard_ECCV_2018_paper.pdf.

  238. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017. https://openreview.net/forum?id=Hkg4TI9xl.

  239. Using pre-training can improve model robustness and uncertainty. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2712--2721. PMLR, 09--15 June 2019. https://proceedings.mlr.press/v97/hendrycks19a.html.

  240. Aligning AI With Shared Human Values
  241. Measuring Coding Challenge Competence With APPS
  242. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021b. https://openreview.net/forum?id=d7KBjmI3GmQ.

  243. Measuring Mathematical Problem Solving With the MATH Dataset
  244. Scaling Laws for Autoregressive Generative Modeling
  245. The weirdest people in the world? Behavioral and Brain Sciences, 33(2-3):61–83, 2010. doi: 10.1017/S0140525X0999152X. https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/weirdest-people-in-the-world/BF84F7517D56AFF7B7EB58411A554C17.

  246. Teaching machines to read and comprehend. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. https://proceedings.neurips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html.

  247. Scaling Laws for Transfer
  248. 3D-DEEP: 3-dimensional deep-learning based on elevation patterns for road scene interpretation. In 2020 IEEE Intelligent Vehicles Symposium (IV), Piscataway, NJ, Oct. 2020. Institute of Electrical and Electronics Engineers. doi: 10.1109/iv47402.2020.9304601. https://doi.org/10.48550/arXiv.2009.00330.
  249. TaPas: Weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4320--4333, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.398. https://aclanthology.org/2020.acl-main.398.

  250. Deep Learning Scaling is Predictable, Empirically
  251. Beyond human-level accuracy: Computational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP ’19, pp.  1–14, New York, NY, USA, 2019. Association for Computing Machinery. doi: 10.1145/3293883.3295710. https://doi.org/10.1145/3293883.3295710.
  252. RNNs can generate bounded hierarchical languages with optimal memory
  253. Mireille Hildebrandt. Algorithmic regulation and the rule of law. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 376(2128):20170355, 2018. doi: 10.1098/rsta.2017.0355. https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2017.0355.
  254. Training Compute-Optimal Large Language Models
  255. Keith J. Holyoak. Analogy and relational reasoning. In Keith J. Holyoak and Robert G. Morrison (eds.), The Oxford Handbook of Thinking and Reasoning. Oxford University Press, Oxford, 2012. https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199734689.001.0001/oxfordhb-9780199734689-e-13.

  256. Richard P. Honeck. A Proverb in Mind: The Cognitive Science of Proverbial Wit and Wisdom. Lawrence Erlbaum Associates, Mahwah, NJ
  257. Alexandra Horowitz. Smelling themselves: Dogs investigate their own odours longer when modified in an “olfactory mirror” test. Behavioural Processes, 143:17--24, 2017. doi: https://doi.org/10.1016/j.beproc.2017.08.001. https://www.sciencedirect.com/science/article/pii/S0376635717300104.

  258. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  523--533, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1058. https://aclanthology.org/D14-1058.

  259. Yufang Hou. Bridging anaphora resolution as question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  1428--1438, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.132. https://aclanthology.org/2020.acl-main.132.

  260. Global inference for bridging anaphora resolution. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  907--917, Atlanta, Georgia, June 2013. Association for Computational Linguistics. https://aclanthology.org/N13-1111.

  261. China Household Management Research Center, Ministry of Public Security. National name report 2018. 2019. http://news.cpd.com.cn/n18151/201901/t20190130_830962.html (Accessed 3 March 2021).

  262. China Household Management Research Center, Ministry of Public Security. National name report 2019. 2020. https://www.mps.gov.cn/n2254314/n6409334/c6874817/content.html (Accessed 3 March 2021).

  263. China Household Management Research Center, Ministry of Public Security. National name report 2020. 2021. https://www.mps.gov.cn/n2253534/n2253535/c7725981/content.html (Accessed 3 March 2021).

  264. Introduction to Paremiology: A Comprehensive Guide to Proverb Studies. De Gruyter Open, Warsaw, 2015. https://www.degruyter.com/document/doi/10.2478/9783110410167/html.

  265. GamePad: A Learning Environment for Theorem Proving
  266. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
  267. Lexical semantic relatedness with random graph walks. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp.  581--589, Prague, Czech Republic, June 2007. Association for Computational Linguistics. https://aclanthology.org/D07-1061.

  268. David Hume. A Treatise of Human Nature. John Noon, London, 1739–1740.
  269. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67:757--795, 2020. doi: 10.1613/jair.1.11674. https://doi.org/10.1613/jair.1.11674.

  270. Can self-awareness be taught? Monkeys pass the mirror test -- again. Proceedings of the National Academy of Sciences, 114(13):3281--3283, 2017. doi: 10.1073/pnas.1701676114. https://www.pnas.org/doi/abs/10.1073/pnas.1701676114.

  271. OpenRefine. https://openrefine.org/

  272. Instagram Engineering. Emojineering part 1: Machine learning for emoji trends. Medium, 1 May 2015. https://instagram-engineering.com/emojineering-part-1-machine-learning-for-emoji-trendsmachine-learning-for-emoji-trends-7f5f9cb979ad.

  273. Automatic detection of generated text is easiest when humans are fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  1808--1822, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.164. https://aclanthology.org/2020.acl-main.164.

  274. AI safety via debate
  275. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
  276. Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages
  277. Roget's Thesaurus as a Lexical Resource for Natural Language Processing
  278. Learning to execute instructions in a Minecraft dialogue. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2589--2602, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.232. https://aclanthology.org/2020.acl-main.232.

  279. Are natural language inference models IMPPRESsive? Learning IMPlicature and PRESupposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8690--8705, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.768. https://aclanthology.org/2020.acl-main.768.

  280. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the 10th Research on Computational Linguistics International Conference, pp.  19--33, Taipei, Taiwan, August 1997. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP). https://aclanthology.org/O97-1002.

  281. Do you know that Florence is packed with visitors? Evaluating state-of-the-art models of speaker commitment. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4208--4213, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1412. https://aclanthology.org/P19-1412.

  282. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423--438, 2020. doi: 10.1162/tacla00324. https://doi.org/10.1162/tacl_a_00324.

  283. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
  284. Robust Encodings: A Framework for Combating Adversarial Typos
  285. Harnessing context incongruity for sarcasm detection. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  757--762, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-2124. https://aclanthology.org/P15-2124.

  286. Automatic sarcasm detection: A survey. ACM Comput. Surv., 50(5), Sep. 2017. doi: 10.1145/3124420. https://doi.org/10.1145/3124420.

  287. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
  288. Inferring algorithmic patterns with stack-augmented recurrent nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, volume 1, pp.  190–198, Cambridge, MA, USA, 2015. MIT Press. doi: 10.5555/2969239.2969261. https://dl.acm.org/doi/10.5555/2969239.2969261.
  289. Template guided text generation for task-oriented dialogue. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6505--6520, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.527. https://aclanthology.org/2020.emnlp-main.527.

  290. Rogue-Gym: A new challenge for generalization in reinforcement learning. In 2019 IEEE Conference on Games (CoG), pp.  1--8, Piscataway, NJ, 2019. Institute of Electrical and Electronics Engineers. doi: 10.1109/CIG.2019.8848075. https://ieeexplore.ieee.org/document/8848075.
  291. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, pp.  3363–3372, New York, NY, USA, 2011. Association for Computing Machinery. doi: 10.1145/1978942.1979444. https://doi.org/10.1145/1978942.1979444.
  292. Immanuel Kant. Critique of Pure Reason. The Cambridge Edition of the Works of Immanuel Kant, edited by Paul Guyer and Allen W. Wood. Cambridge University Press, 1781/1787. doi: 10.1017/CBO9780511804649. https://doi.org/10.1017/CBO9780511804649.

  293. Immanuel Kant. Prolegomena to Any Future Metaphysics. Cambridge Texts in the History of Philosophy, edited by Gary Hatfield. Cambridge University Press, 2nd edition, 1783. doi: 10.1017/CBO9780511808517. https://doi.org/10.1017/CBO9780511808517.

  294. Scaling Laws for Neural Language Models
  295. Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks. Andrej Karpathy’s blog, 21 May 2015. http://karpathy.github.io/2015/05/21/rnn-effectiveness/.

  296. Lauri Karttunen. Simple and phrasal implicatives. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics -- Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp.  124--131, Montréal, Canada, 7-8 June 2012. Association for Computational Linguistics. https://aclanthology.org/S12-1020.

  297. Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly
  298. Are Pretrained Language Models Symbolic Reasoners Over Knowledge?
  299. Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
  300. Alignment of Language Agents
  301. Os Keyes. The misgendering machines: Trans/HCI implications of automatic gender recognition. In Proceedings of the ACM on human-computer interaction, volume 2, New York, NY, USA, Nov. 2018. Association for Computing Machinery. doi: 10.1145/3274357. https://doi.org/10.1145/3274357.

  302. How do humans teach: On curriculum learning and teaching dimension. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. https://proceedings.neurips.cc/paper/2011/file/f9028faec74be6ec9b852b0a542e2f39-Paper.pdf.

  303. ParsiNLU: A Suite of Language Understanding Challenges for Persian
  304. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1896--1907, Online, November 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.171. https://aclanthology.org/2020.findings-emnlp.171.

  305. A Large Self-Annotated Corpus for Sarcasm
  306. Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4110--4124, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.324. https://aclanthology.org/2021.naacl-main.324.

  307. Cooperation and codenames: Understanding natural language processing via codenames. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 15, pp. 160--166, Menlo Park, CA, Oct. 2019. Association for the Advancement of Artificial Intelligence. https://ojs.aaai.org/index.php/AIIDE/article/view/5239.

  308. Character-Aware Neural Language Models
  309. Evaluating approaches to personalizing language models. In Proceedings of the 12th Language Resources and Evaluation Conference, pp.  2461--2469, Marseille, France, May 2020. European Language Resources Association. https://aclanthology.org/2020.lrec-1.299.

  310. Recurrent Neural Networks in Linguistic Theory: Revisiting Pinker and Prince (1988) and the Past Tense Debate. Transactions of the Association for Computational Linguistics, 6:651--665, 12 2018. ISSN 2307-387X. doi: 10.1162/tacla00247. https://doi.org/10.1162/tacl_a_00247.

  311. Emanuel Kitzelmann. Inductive programming: A survey of program synthesis techniques. In Ute Schmid, Emanuel Kitzelmann, and Rinus Plasmeijer (eds.), Approaches and Applications of Inductive Programming, pp.  50--73, Berlin, 2010. Springer. doi: 10.1007/978-3-642-11931-6. https://doi.org/10.1007/978-3-642-11931-6.

  312. Joshua Knobe. Intentional action and side effects in ordinary language. Analysis, 63, 07 2003. doi: 10.1111/1467-8284.00419. https://www.researchgate.net/publication/28763794_Intentional_Action_and_Side_Effects_in_Ordinary_Language.
  313. A surprisingly robust trick for the Winograd schema challenge. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4837--4842, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1478. https://aclanthology.org/P19-1478.

  314. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317--328, 2018. doi: 10.1162/tacla00023. https://aclanthology.org/Q18-1023.

  315. MultiEmo: Multilingual, multilevel, multidomain sentiment analysis corpus of consumer reviews. In Maciej Paszynski, Dieter Kranzlmüller, Valeria V. Krzhizhanovskaya, Jack J. Dongarra, and Peter M. A. Sloot (eds.), Computational Science -- ICCS 2021, pp.  297--312, Cham, 2021. Springer. doi: 10.1007/978-3-030-77964-124. https://doi.org/10.1007/978-3-030-77964-124.

  316. Counterlogicals as counterconventionals. Journal of Philosophical Logic, 50:673--704, 2021. doi: 10.1007/s10992-020-09581-6. https://doi.org/10.1007/s10992-020-09581-6.

  317. Against conventional wisdom. Philosophers’ Imprint, 20(22):1--27, 2020. http://hdl.handle.net/2027/spo.3521354.0020.022.
  318. Authorship verification as a one-class classification problem. In Proceedings of the Twenty-First International Conference on Machine Learning, pp.  62, New York, NY, USA, 2004. Association for Computing Machinery. doi: 10.1145/1015330.1015448. https://doi.org/10.1145/1015330.1015448.
  319. Jarmo Korhonen. Sprichwörter und zweisprachige lexikographie: Deutsch-schwedische und deutsch-finnische wörtebücher im vergleich. In C. Földes (ed.), Phraseologie disziplinär und interdisziplinär, pp.  537--549. Gunter Narr Verlag
  320. A burstiness-aware approach for document dating. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’14, pp. 1003–1006, New York, NY, USA, 2014. Association for Computing Machinery. doi: 10.1145/2600428.2609495. https://doi.org/10.1145/2600428.2609495.
  321. Self-Aware Computing Systems. Springer, Cham, 2017. https://link.springer.com/book/10.1007/978-3-319-47474-8.

  322. The aha! moment: The cognitive neuroscience of insight. Current Directions in Psychological Science, 18(4):210--216, 2009. doi: 10.1111/j.1467-8721.2009.01638.x. https://doi.org/10.1111/j.1467-8721.2009.01638.x.
  323. WikiHow: A Large Scale Text Summarization Dataset
  324. All the news that’s fit to fabricate: AI-generated text as a tool of media misinformation. SSRN, 24 Sep 2020. doi: 10.2139/ssrn.3525002. http://dx.doi.org/10.2139/ssrn.3525002.

  325. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50--72, 01 2022. doi: 10.1162/tacla00447. https://doi.org/10.1162/tacl_a_00447.

  326. Hurdles to Progress in Long-form Question Answering
  327. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9332--9346, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.750. https://aclanthology.org/2020.emnlp-main.750.

  328. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  66--71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. https://aclanthology.org/D18-2012.

  329. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453--466, 08 2019. doi: 10.1162/tacla00276. https://doi.org/10.1162/tacl_a_00276.

  330. Human vs. supervised machine learning: Who learns patterns faster?
  331. The NetHack Learning Environment
  332. Kevin Lacker. Giving GPT-3 a Turing test. Kevin Lacker’s blog, July 2020. https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html.

  333. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  785--794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. https://aclanthology.org/D17-1082.

  334. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks
  335. Word meaning in minds and machines
  336. Building machines that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017. doi: 10.1017/S0140525X16001837. https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/building-machines-that-learn-and-think-like-people/A9535B1D745A0377E16C590E14B94993.

  337. Metaphors We Live By. University of Chicago Press, Chicago
  338. The emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  11--20, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1002. https://aclanthology.org/N19-1002.

  339. Can RNNs learn Recursive Nested Subject-Verb Agreements?
  340. Mechanisms for handling nested dependencies in neural-network language models and humans. Cognition, 213:104699, 2021b. doi: https://doi.org/10.1016/j.cognition.2021.104699. https://www.sciencedirect.com/science/article/pii/S0010027721001189. Special Issue in Honour of Jacques Mehler, Cognition’s founding editor.
  341. Deep Learning for Symbolic Mathematics
  342. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
  343. Revisiting the evaluation of theory of mind through question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  5872--5877, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1598. https://aclanthology.org/D19-1598.

  344. Language models as fact checkers? In Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER), pp.  36--41, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.fever-1.5. https://aclanthology.org/2020.fever-1.5.

  345. Towards few-shot fact-checking via perplexity. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1971--1981, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.158. https://aclanthology.org/2021.naacl-main.158.

  346. Scalable agent alignment via reward modeling: a research direction
  347. Solving logic puzzles: From robust processing to precise semantics. In Proceedings of the 2nd Workshop on Text Meaning and Interpretation, pp.  9--16, Barcelona, Spain, July 2004. Association for Computational Linguistics. https://aclanthology.org/W04-0902.

  348. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp.  333--342, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/K17-1034. https://aclanthology.org/K17-1034.

  349. TR9856: A multi-word term relatedness benchmark. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  419--424, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-2069. https://aclanthology.org/P15-2069.

  350. Investigating Memorization of Conspiracy Theories in Text Generation
  351. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7315--7330, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.653. https://aclanthology.org/2020.acl-main.653.

  352. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
  353. Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets
  354. UNQOVERing stereotyping biases via underspecified questions. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  3475--3489, Online, November 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.311. https://aclanthology.org/2020.findings-emnlp.311.

  355. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  986--995, Taipei, Taiwan, November 2017. Asian Federation of Natural Language Processing. https://aclanthology.org/I17-1099.

  356. DELPHI: Accurate deep ensemble model for protein interaction sites prediction. Bioinformatics, 37(7):896--904, 08 2020b. doi: 10.1093/bioinformatics/btaa750. https://doi.org/10.1093/bioinformatics/btaa750.

  357. Properties of the LWR model with time delay
  358. A meaning-based statistical English math word problem solver. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  652--662, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1060. https://aclanthology.org/N18-1060.

  359. Towards debiasing sentence representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5502--5515, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.488. https://aclanthology.org/2020.acl-main.488.

  360. Learning to contrast the counterfactual samples for robust visual question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  3285--3292, Online, November 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.265. https://aclanthology.org/2020.emnlp-main.265.

  361. Birds have four legs?! NumerSense: probing numerical commonsense knowledge of pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6862--6868, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.557. https://aclanthology.org/2020.emnlp-main.557.

  362. RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge
  363. Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp.  58--62, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-5808. https://aclanthology.org/D19-5808.

  364. TruthfulQA: Measuring How Models Mimic Human Falsehoods
  365. Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
  366. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521--535, 12 2016. doi: 10.1162/tacla00115. https://doi.org/10.1162/tacl_a_00115.

  367. What Makes Good In-Context Examples for GPT-$3$?
  368. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning
  369. Do Question Answering Modeling Improvements Hold Across Benchmarks?
  370. A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation
  371. Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering
  372. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  373. Multilingual Denoising Pre-training for Neural Machine Translation
  374. SemEval-2015 task 5: QA TempEval - evaluating temporal information understanding with question answering. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp.  792--800, Denver, Colorado, June 2015. Association for Computational Linguistics. doi: 10.18653/v1/S15-2134. https://aclanthology.org/S15-2134.

  375. Content preserving text generation with attribute controls
  376. Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes
  377. UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark
  378. Gender bias in neural natural language processing. In Vivek Nigam, Tajana Ban Kirigin, Carolyn Talcott, Joshua Guttman, Stepan Kuznetsov, Boon Thau Loo, and Mitsuhiro Okada (eds.), Logic, Language, and Security. Springer, Cham, 2020. https://www.springerprofessional.de/en/gender-bias-in-neural-natural-language-processing/18531692.

  379. What’s in the box? An analysis of undesirable content in the Common Crawl corpus. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  182--189, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.24. https://aclanthology.org/2021.acl-short.24.

  380. A survey of reinforcement learning informed by natural language. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp.  6309–6317, 2019. https://www.ijcai.org/proceedings/2019/0880.pdf.

  381. EventPlus: A Temporal Event Understanding Pipeline
  382. Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems
  383. Few-Shot Bot: Prompt-Based Learning for Dialogue Systems
  384. Low-resource Languages: A Review of Past Work and Future Challenges
  385. Automatic prediction of discourse connectives. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association. https://aclanthology.org/L18-1260.

  386. Encode, Tag, Realize: High-Precision Text Editing
  387. A BERT-based approach for automatic humor detection and scoring. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), pp.  197--202, 2019. http://ceur-ws.org/Vol-2421/HAHA_paper_8.pdf.

  388. The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence
  389. GPT-3, bloviator: OpenAI’s language generator has no idea what it’s talking about. MIT Technology Review, 22 August 2020. https://www.technologyreview.com/2020/08/22/1007539/gpt3-openai-language-generator-artificial-intelligence-ai-opinion/.

  390. The Penn Treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, 1994. https://aclanthology.org/H94-1020.

  391. Collective classification for fine-grained information status. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  795--804, Jeju Island, Korea, July 2012. Association for Computational Linguistics. https://aclanthology.org/P12-1084.

  392. Inclusive data visualization for people with disabilities: A call to action. Interactions, 28(3):47–51, Apr. 2021. doi: 10.1145/3457875. https://doi.org/10.1145/3457875.

  393. Research community dynamics behind popular AI benchmarks. Nature Machine Intelligence, 3(7):581--589, 2021. doi: 10.1038/s42256-021-00339-6. https://doi.org/10.1038/s42256-021-00339-6.

  394. Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1192--1202, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1151. https://aclanthology.org/D18-1151.

  395. Recipe1M+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):187--203, 2021. doi: 10.1109/TPAMI.2019.2927476. https://ieeexplore.ieee.org/abstract/document/8758197.
  396. Annotating Character Relationships in Literary Texts
  397. Suicide risk assessment with multi-level dual-context language and BERT. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pp.  39--44, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-3005. https://aclanthology.org/W19-3005.

  398. On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  622--628, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1063. https://aclanthology.org/N19-1063.

  399. Andrew Mayne. OpenAI API alchemy: Emoji storytelling. Andrew Mayne blog, 24 June 2020. https://andrewmayneblog.wordpress.com/2020/06/24/open-ai-alchemy-emoji-storytelling/.

  400. On Faithfulness and Factuality in Abstractive Summarization
  401. Context based spelling correction. Information Processing & Management, 27(5):517--522, 1991. doi: https://doi.org/10.1016/0306-4573(91)90066-U. https://www.sciencedirect.com/science/article/pii/030645739190066U.

  402. The application of convolution neural network based cell segmentation during cryopreservation. Cryobiology, 85:95--104, 2018. doi: https://doi.org/10.1016/j.cryobiol.2018.09.003. https://www.sciencedirect.com/science/article/pii/S0011224018301937.

  403. Image-based Recommendations on Styles and Substitutes
  404. The Natural Language Decathlon: Multitask Learning as Question Answering
  405. Extending Machine Language Models toward Human-Level Language Understanding
  406. Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks
  407. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
  408. Does Syntax Need to Grow on Trees? Sources of Hierarchical Inductive Bias in Sequence-to-Sequence Networks. Transactions of the Association for Computational Linguistics, 8:125--140, 01 2020. ISSN 2307-387X. doi: 10.1162/tacla00304. https://doi.org/10.1162/tacl_a_00304.

  409. Acquisition of Chess Knowledge in AlphaZero
  410. Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
  411. USR: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  681--707, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.64. https://aclanthology.org/2020.acl-main.64.

  412. Christine Palm Meister. Phraseologie des schwedischen. In H. Burger et al. (ed.), Phraseologie/Phrasology, volume 2, pp.  673--681. De Gruyter Mouton, 2007. doi: 10.1515/9783110190762.673. https://doi.org/10.1515/9783110190762.673.

  413. Interactive optimal teaching with unknown learners. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp.  2567--2573, 2018. doi: 10.24963/ijcai.2018/356. https://doi.org/10.24963/ijcai.2018/356.

  414. A framework for the computational linguistic analysis of dehumanization. Frontiers in Artificial Intelligence, 3, 2020. doi: 10.3389/frai.2020.00055. https://www.frontiersin.org/article/10.3389/frai.2020.00055.
  415. Temporal information extraction for question answering using syntactic dependencies in an LSTM-based architecture. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  887--896, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1092. https://aclanthology.org/D17-1092.

  416. Pointer Sentinel Mixture Models
  417. On the Linguistic Capacity of Real-Time Counter Automata
  418. Wolfgang Mieder. "Andere zeiten, andere lehren": Sprach-und kulturgeschichtliche betrachtungen zum sprichwort. In K. Steyer (ed.), Wortverbindungen - mehr oder weniger fest, pp.  415--438. De Gruyter, Berlin, 2019. doi: 10.1515/9783110622768-020. https://doi.org/10.1515/9783110622768-020.

  419. Making computers laugh: Investigations in automatic humor recognition. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 531--538, Vancouver, British Columbia, Canada, October 2005. Association for Computational Linguistics. https://aclanthology.org/H05-1067.

  420. The effect of natural distribution shift on question answering models. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  6905--6916. PMLR, 13--18 July 2020. https://proceedings.mlr.press/v119/miller20a.html.

  421. Automatic disambiguation of English puns. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  719--729, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-1070. https://aclanthology.org/P15-1070.

  422. SemEval-2017 task 7: Detection and interpretation of English puns. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp.  58--68, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2005. https://aclanthology.org/S17-2005.

  423. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, pp.  25–30, Menlo Park, 2008. Association for the Advancement of Artificial Intelligence. https://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf.

  424. Republic of China Ministry of the Interior. National name statistical analysis, 2018. https://www.ris.gov.tw/documents/data/5/2/107namestat.pdf (Accessed 3 March 2021).

  425. Cross-Task Generalization via Natural Language Crowdsourcing Instructions
  426. Natural reference to objects in a visual domain. In Proceedings of the 6th International Natural Language Generation Conference. Association for Computational Linguistics, July 2010. https://aclanthology.org/W10-4210.

  427. Generating expressions that refer to visible objects. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1174--1184, Atlanta, Georgia, June 2013. Association for Computational Linguistics. https://aclanthology.org/N13-1137.

  428. Playing Atari with Deep Reinforcement Learning
  429. CLaC at CLPsych 2019: Fusion of neural features and predicted class probabilities for suicide risk assessment based on online posts. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pp.  34--38, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-3004. https://aclanthology.org/W19-3004.

  430. Introducing the LCC metaphor datasets. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp.  4221--4227, Portorož, Slovenia, May 2016. European Language Resources Association. https://aclanthology.org/L16-1668.

  431. Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17(1):21--48, 1991. https://aclanthology.org/J91-1002.

  432. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories
  433. Structure here, bias there: Hierarchical generalization by jointly learning syntactic transformations. In Proceedings of the Society for Computation in Linguistics 2021, pp.  125--135, Online, February 2021. Association for Computational Linguistics. https://aclanthology.org/2021.scil-1.12.

  434. Applying the naïve Bayes classifier with kernel density estimation to the prediction of protein–protein interaction sites. Bioinformatics, 26(15):1841--1848, 06 2010. doi: 10.1093/bioinformatics/btq302. https://doi.org/10.1093/bioinformatics/btq302.

  435. Gregory L. Murphy. Comprehending complex concepts. Cognitive Science, 12(4):529--562, 1988. doi: https://doi.org/10.1207/s15516709cog1204_2. https://onlinelibrary.wiley.com/doi/abs/10.1207/s15516709cog1204_2.

  436. StereoSet: Measuring stereotypical bias in pretrained language models
  437. Obtaining well calibrated probabilities using Bayesian binning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  2901--2907, Menlo Park, CA, 2015. Association for the Advancement of Artificial Intelligence. https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9667.

  438. Stress Test Evaluation for Natural Language Inference
  439. More Data Can Hurt for Linear Regression: Sample-wise Double Descent
  440. The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers
  441. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003
  442. Ramanujapuram Narasimhachar. History of Kannada Literature: Readership Lectures. Asian Educational Services, New Dehli
  443. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1797--1807, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1206. https://aclanthology.org/D18-1206.

  444. The Montreal Cognitive Assessment, MoCA: A brief screening tool for mild cognitive impairment. Journal of the American Geriatrics Society, 53(4):695--699, 2005. doi: https://doi.org/10.1111/j.1532-5415.2005.53221.x. https://agsjournals.onlinelibrary.wiley.com/doi/abs/10.1111/j.1532-5415.2005.53221.x.
  445. Participatory research for low-resourced machine translation: A case study in African languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  2144--2160, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.195. https://aclanthology.org/2020.findings-emnlp.195.

  446. Evaluating theory of mind in question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  2392--2400, Brussels, Belgium, October--November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1261. https://aclanthology.org/D18-1261.

  447. Posterior calibration and exploratory analysis for natural language processing models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  1587--1598, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1182. https://aclanthology.org/D15-1182.

  448. Comparisons of sequence labeling algorithms and extensions. In Proceedings of the 24th International Conference on Machine Learning, pp.  681–688, New York, NY, USA, 2007. Association for Computing Machinery. doi: 10.1145/1273496.1273582. https://doi.org/10.1145/1273496.1273582.
  449. DisSent: Learning sentence representations from explicit discourse relations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4497--4510, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1442. https://aclanthology.org/P19-1442.

  450. Proverb comprehension as a function of reading proficiency in preadolescents. Language Speech and Hearing Services in Schools, 32:90, 04 2001. doi: 10.1044/0161-1461(2001/009). https://www.researchgate.net/publication/285246680_Proverb_Comprehension_as_a_Function_of_Reading_Proficiency_in_Preadolescents.

  451. Generating natural anagrams: Towards language generation under hard combinatorial constraints. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  6408--6412, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1674. https://aclanthology.org/D19-1674.

  452. The Chess Transformer: Mastering Play using Generative Language Models
  453. "The things that we have to do": Ethics and instrumentality in humanitarian communication. Global Media and Communication, 9(1):53--70, 2013. doi: 10.1177/1742766512463040. https://doi.org/10.1177/1742766512463040.

  454. Show Your Work: Scratchpads for Intermediate Computation with Language Models
  455. Effects of directionality in deductive reasoning, I. The comprehension of single relational premises. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26(6):1702--1712, 2000. doi: 10.1037/0278-7393.26.6.1702. https://doi.org/10.1037/0278-7393.26.6.1702.

  456. Effects of directionality in deductive reasoning, II. Premise integration and conclusion evaluation. The Quarterly Journal of Experimental Psychology Section A, 58(7):1225--1247, 2005. doi: 10.1080/02724980443000566. https://doi.org/10.1080/02724980443000566.

  457. The Working Committee on the Revision of the National Standard Occupational Classification. Standard Occupational Classification of the People’s Republic of China. China Labour and Social Security Publishing House, 2015. http://www.jiangmen.gov.cn/bmpd/jmsrlzyhshbzj/zwfw/bmjd/jdks/content/post_2334804.html (Accessed 4 June 2022).

  458. iSarcasm: A dataset of intended sarcasm. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  1279--1289, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.118. https://aclanthology.org/2020.acl-main.118.

  459. Type-and-example-directed program synthesis. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’15, pp.  619–630, New York, NY, USA, 2015. Association for Computing Machinery. doi: 10.1145/2737924.2738007. https://doi.org/10.1145/2737924.2738007.
  460. Revisions that improve cohesion in multi-document summaries: A preliminary study. In Proceedings of the ACL-02 Workshop on Automatic Summarization, pp.  27--44, Phildadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1118162.1118166. https://aclanthology.org/W02-0404.
  461. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. https://proceedings.neurips.cc/paper/2019/hash/8558cb408c1d76621371888657d2eb1d-Abstract.html.

  462. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4812--4829, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.383. https://aclanthology.org/2021.naacl-main.383.

  463. Sarcasm Detection using Context Separators in Online Discourse
  464. A Review of Speaker Diarization: Recent Advances with Deep Learning
  465. BBQ: A Hand-Built Bias Benchmark for Question Answering
  466. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2080--2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. https://aclanthology.org/2021.naacl-main.168.

  467. Carbon Emissions and Large Neural Network Training
  468. Anthony M. Paul. Figurative language. Philosophy & Rhetoric, 3(4):225--248, 1970. http://www.jstor.org/stable/40237206.

  469. Learning Algorithms via Neural Logic Networks
  470. Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco
  471. Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge
  472. Deep and Dense Sarcasm Detection
  473. True Few-Shot Learning with Language Models
  474. Don’t patronize me! An annotated dataset with patronizing and condescending language towards vulnerable communities. In Proceedings of the 28th International Conference on Computational Linguistics, pp.  5891--5902, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.518. https://aclanthology.org/2020.coling-main.518.

  475. Language Models as Knowledge Bases?
  476. Data cleaning: A case study with OpenRefine and Trifacta Wrangler. In Martin Shepperd, Fernando Brito e Abreu, Alberto Rodrigues da Silva, and Ricardo Pérez-Castillo (eds.), Quality of Information and Communications Technology, pp.  32--40, Cham, 2020. Springer. doi: 10.1007/978-3-030-58793-23. https://doi.org/10.1007/978-3-030-58793-23.

  477. Out of Order: How Important Is The Sequential Order of Words in a Sentence in Natural Language Understanding Tasks?
  478. Steve Piantadosi. Fleet system, 2020. https://github.com/piantado/Fleet.

  479. Tony A. Plate. Distributed representations and nested compositional structure. PhD thesis, University of Toronto, Toronto
  480. Tony A. Plate. Holographic Reduced Representations: Distributed Representation for Cognitive Structures. CSLI, Stanford, CA
  481. Robert Plutchik. A general psychoevolutionary theory of emotion. In Robert Plutchik and Henry Kellerman (eds.), Theories of Emotion, pp.  3--33. Academic Press, 1980. doi: https://doi.org/10.1016/B978-0-12-558701-3.50007-7. https://www.sciencedirect.com/science/article/pii/B9780125587013500077.

  482. Program synthesis from polymorphic refinement types. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’16, pp.  522–538, New York, NY, USA, 2016. Association for Computing Machinery. doi: 10.1145/2908080.2908093. https://doi.org/10.1145/2908080.2908093.
  483. Generative Language Modeling for Automated Theorem Proving
  484. Knowledge derived from Wikipedia for computing semantic relatedness. Journal of Artificial Intelligence Research, 30:181--212, 2007. https://jair.org/index.php/jair/article/view/10513.

  485. SemEval 2015, task 7: Diachronic text evaluation. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp.  870--878, Denver, Colorado, June 2015. Association for Computational Linguistics. doi: 10.18653/v1/S15-2147. https://aclanthology.org/S15-2147.

  486. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  527--536, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1050. https://aclanthology.org/P19-1050.

  487. A transformer-based approach to irony and sarcasm detection. Neural Computing and Applications, 32:17309--17320, 2020. doi: 10.1007/s00521-020-05102-3. https://link.springer.com/article/10.1007/s00521-020-05102-3.

  488. Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases
  489. Joking riddles: A developmental index of children’s humor. Developmental Psychology, 11:210--216, 1975. doi: 10.1037/h0076455. https://doi.org/10.1037/h0076455.

  490. An Analysis of the Adaptation Speed of Causal Models
  491. The specification language TimeML, 2004. http://xml.coverpages.org/TimeML-SpecLang200401.pdf.

  492. Qimingtong. What are the most popular names chinese parents give their babies? a perspective from big data. 2016. https://www.qimingtong.com/article/0 (Accessed 3 March 2021).

  493. TIMEDIAL: Temporal commonsense reasoning in dialog. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  7066--7076, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.549. https://aclanthology.org/2021.acl-long.549.

  494. Willard V.O. Quine. Main trends in recent philosophy: Two dogmas of empiricism. The Philosophical Review, 60(1):20--43, 1951. http://www.jstor.org/stable/2181906.

  495. The North American computational linguistics olympiad (NACLO). In Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, pp.  87--96, Columbus, Ohio, June 2008. Association for Computational Linguistics. https://aclanthology.org/W08-0211.

  496. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.

  497. A word at a time: Computing word relatedness using temporal semantic analysis. In WWW ’11: Proceedings of the 20th International Conference on World Wide Web, pp.  337–346, New York, NY, USA, 2011. Association for Computing Machinery. doi: 10.1145/1963405.1963455. https://doi.org/10.1145/1963405.1963455.
  498. Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  499. Word sense disambiguation: A unified evaluation framework and empirical comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp.  99--110, Valencia, Spain, April 2017. Association for Computational Linguistics. https://aclanthology.org/E17-1010.

  500. Resolving complex cases of definite pronouns: The Winograd schema challenge. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp.  777--789, Jeju Island, Korea, July 2012. Association for Computational Linguistics. https://aclanthology.org/D12-1071.

  501. A survey on computational metaphor processing. ACM Comput. Surv., 53(2), mar 2020. doi: 10.1145/3373265. https://doi.org/10.1145/3373265.

  502. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  2383--2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. https://aclanthology.org/D16-1264.

  503. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  784--789, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. https://aclanthology.org/P18-2124.

  504. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digital Medicine, 4(1):86, 2021. doi: 10.1038/s41746-021-00455-y. https://doi.org/10.1038/s41746-021-00455-y.

  505. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  8689--8696, Menlo Park, CA, Apr. 2020. Association for the Advancement of Artificial Intelligence. doi: 10.1609/aaai.v34i05.6394. https://ojs.aaai.org/index.php/AAAI/article/view/6394.

  506. Ian Ravenscroft. Folk psychology as a theory. In Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2019 edition
  507. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249--266, 2019. doi: 10.1162/tacla00266. https://aclanthology.org/Q19-1016.

  508. Neural Programmer-Interpreters
  509. Semi-supervised Multitask Learning for Sequence Labeling
  510. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
  511. Multi-prototype vector-space models of word meaning. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp.  109--117, Los Angeles, California, June 2010. Association for Computational Linguistics. https://aclanthology.org/N10-1013.

  512. He Ren and Quan Yang. Neural joke generation, 2017. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1174/reports/2760332.pdf.

  513. Philip Resnik. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI’95: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Volume 1, pp.  448–453, San Francisco, 1995. Morgan Kaufmann. doi: 10.5555/1625855.1625914. https://dl.acm.org/doi/10.5555/1625855.1625914.
  514. Philip Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11:95--130, 1999. doi: 10.1613/jair.514. https://doi.org/10.1613/jair.514.

  515. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. AAAI Spring Symposium, 2011. http://commonsensereasoning.org/2011/papers/Roemmele.pdf.

  516. A Constructive Prediction of the Generalization Error Across Scales
  517. How well do NLI models capture verb veridicality? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2230--2240, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1228. https://aclanthology.org/D19-1228.

  518. Game-theoretic applications of a relational risk model
  519. Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP
  520. XTREME-R: Towards more challenging and nuanced multilingual evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  10215--10245, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.802. https://aclanthology.org/2021.emnlp-main.802.

  521. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.  8--14, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2002. https://aclanthology.org/N18-2002.

  522. Comparing conventions. In Joseph Rhyne et al. (ed.), Proceedings of Semantics and Linguistic Theory, pp.  294--313, Washington, D.C., 2020. Linguistic Society of America. doi: 10.3765/salt.v30i0.4820. https://doi.org/10.3765/salt.v30i0.4820.

  523. Number-space mapping in the newborn chick resembles humans’ mental number line. Science, 347(6221):534--536, 2015. doi: 10.1126/science.aaa1379. https://www.science.org/doi/abs/10.1126/science.aaa1379.

  524. Joshua S. Rule. The child as hacker: Building more human-like models of learning. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 2020. https://hdl.handle.net/1721.1/129232.

  525. The child as hacker. Trends in Cognitive Sciences, 24(11):900--915, 2020. doi: https://doi.org/10.1016/j.tics.2020.07.005. https://www.sciencedirect.com/science/article/pii/S1364661320301741.

  526. Parallel Distributed Processing. Volume 1: Foundations. MIT Press, Cambridge, MA
  527. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  379--389, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1044. https://aclanthology.org/D15-1044.

  528. Artificial Intelligence: A Modern Approach. Pearson, Hoboken, 2002. http://aima.cs.berkeley.edu/.

  529. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
  530. PuzzLing Machines: A challenge on learning from small data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  1241--1254, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.115. https://aclanthology.org/2020.acl-main.115.

  531. Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network
  532. WINOGRANDE: An adversarial Winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI-20, pp.  8732--8734, Menlo Park, CA, 2020. Association for the Advancement of Artificial Intelligence. https://ojs.aaai.org/index.php/AAAI/article/view/6399/6255.

  533. Automatic detection of satire in Twitter: A psycholinguistic-based approach. Knowledge-Based Systems, 128:20--33, 2017. doi: https://doi.org/10.1016/j.knosys.2017.04.009. https://www.sciencedirect.com/science/article/pii/S0950705117301855.

  534. Masked language model scoring. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2699--2712, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.240. https://aclanthology.org/2020.acl-main.240.

  535. Temporal reasoning in natural language processing: A survey. International Journal of Computer Applications, 1(4):53--57, 2010. https://www.ijcaonline.org/journal/number4/pxc387209.pdf.

  536. Evan Sandhaus. The New York Times annotated corpus LDC2008T19. Linguistic Data Consortium, 2008. https://catalog.ldc.upenn.edu/LDC2008T19.

  537. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=9Vrb9D0WI4.

  538. A simple neural network module for relational reasoning
  539. Symbolic Behaviour in Artificial Intelligence
  540. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4463--4473, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1454. https://aclanthology.org/D19-1454.

  541. Social bias frames: Reasoning about social and power implications of language. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5477--5490, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.486. https://aclanthology.org/2020.acl-main.486.

  542. Analysing Mathematical Reasoning Abilities of Neural Models
  543. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
  544. Language Tasks and Language Games: On Methodology in Current Natural Language Processing Research
  545. Get your vitamin C! Robust fact verification with contrastive evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  624--643, Online, June 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.52. https://aclanthology.org/2021.naacl-main.52.

  546. Programming puzzles. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021b. https://openreview.net/forum?id=fe_hCc4RBrg.

  547. Towards Causal Representation Learning
  548. Megan Scudellari. Cryopreservation aims to engineer novel ways to freeze, store, and thaw organs. Proceedings of the National Academy of Sciences, 114(50):13060--13062, 2017. doi: 10.1073/pnas.1717588114. https://www.pnas.org/doi/abs/10.1073/pnas.1717588114.

  549. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1073--1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. https://www.aclweb.org/anthology/P17-1099.

  550. BLEURT: Learning Robust Metrics for Text Generation
  551. Learning a SAT Solver from Single-Bit Supervision
  552. Does He Wink or Does He Nod? A Challenging Benchmark for Evaluating Word Understanding of Language Models
  553. Evaluating the Ability of LSTMs to Learn Context-Free Grammars
  554. How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs
  555. Revisiting low-resource neural machine translation: A case study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  211--221, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1021. https://aclanthology.org/P19-1021.

  556. Neural Machine Translation of Rare Words with Subword Units
  557. Diagram understanding in geometry questions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, Menlo Park, CA, Jun. 2014. Association for the Advancement of Artificial Intelligence. https://ojs.aaai.org/index.php/AAAI/article/view/9146.

  558. Counterfactual learning in networks: An empirical study of model dependence, 2021. https://www.cs.uic.edu/~elena/pubs/shahid-why19.pdf.

  559. Janelle Shane. All your questions answered. AI Weirdness, 20 June 2020. https://www.aiweirdness.com/all-your-questions-answered-20-06-17/.

  560. Inferring LISP programs from examples. In IJCAI’75: Proceedings of the 4th International Joint Conference on Artificial Intelligence, volume 1, pp.  260--267. Artificial Intelligence Laboratory, Cambridge, MA, 1975. doi: 10.7916/D89K4K6X. https://academiccommons.columbia.edu/doi/10.7916/D89K4K6X.

  561. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
  562. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3407--3412, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1339. https://aclanthology.org/D19-1339.

  563. Neural Logic Reasoning
  564. Expert, crowdsourced, and machine assessment of suicide risk via online postings. In Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pp.  25--36, New Orleans, LA, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-0603. https://aclanthology.org/W18-0603.

  565. Abu Awal Md Shoeb and Gerard de Melo. EmoTag1200: Understanding the association between emojis and emotions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  8957--8967, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.720. https://aclanthology.org/2020.emnlp-main.720.

  566. Retrieval Augmentation Reduces Hallucination in Conversation
  567. Ekaterina Shutova. Automatic metaphor interpretation as a paraphrasing task. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp.  1029--1037, Los Angeles, California, June 2010. Association for Computational Linguistics. https://aclanthology.org/N10-1147.

  568. Metaphor corpus annotated for source-target domain mappings. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, May 2010. European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2010/pdf/612_Paper.pdf.

  569. Mining discourse markers for unsupervised sentence representation learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  3477--3486, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1351. https://aclanthology.org/N19-1351.

  570. DiscSense: Automated semantic analysis of discourse markers. In Proceedings of the 12th Language Resources and Evaluation Conference, pp.  991--999, Marseille, France, May 2020. European Language Resources Association. https://aclanthology.org/2020.lrec-1.125.

  571. Zero-shot recommendation as language modeling. In Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (eds.), Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, pp. 223–230, Cham, 2022. Springer. doi: 10.1007/978-3-030-99739-726. https://doi.org/10.1007/978-3-030-99739-726.

  572. SPRINGS: Prediction of protein-protein interaction sites using artificial neural networks. Journal of Proteomics & Computational Biology, 1:7, 2014. https://www.avensonline.org/fulltextarticles/JPCB-2572-8679-01-0001.html.

  573. Predicting a correct program in programming by example. In Daniel Kroening and Corina S. Păsăreanu (eds.), Computer Aided Verification, pp.  398--414, Cham, 2015. Springer International Publishing. doi: 10.1007/978-3-319-21690-423. https://doi.org/10.1007/978-3-319-21690-423.

  574. Transforming spreadsheet data types using examples. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’16, pp.  343–356, New York, NY, USA, 2016. Association for Computing Machinery. doi: 10.1145/2837614.2837668. https://doi.org/10.1145/2837614.2837668.
  575. COM2SENSE: A commonsense reasoning benchmark with complementary sentences. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  883--898, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.78. https://aclanthology.org/2021.findings-acl.78.

  576. CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text
  577. Closing brackets with recurrent neural networks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  232--239, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5425. https://aclanthology.org/W18-5425.

  578. Douglas R. Smith. The synthesis of LISP programs from examples: A survey. In Alan W. Biermann, Gerhard Guiho, and Yves Kodratoff (eds.), Automatic Program Construction Techniques, pp.  307--324. Macmillan, New York
  579. Paul Smolensky. On the proper treatment of connectionism. Behavioral and Brain Sciences, 11(1):1–23, 1988. doi: 10.1017/S0140525X00052432.
  580. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.  1631--1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. https://aclanthology.org/D13-1170.

  581. Release Strategies and the Social Impacts of Language Models
  582. Early detection of freeze damage in navel orange fruit using nondestructive low intensity ultrasound coupled with machine learning. Food Analytical Methods, 14:1140--1149, 2021. https://doi.org/10.1007/s12161-020-01942-w.

  583. Pragmatics, modularity and mind-reading. Mind & Language, 17(1-2):3--23, 2002. doi: https://doi.org/10.1111/1468-0017.00186. https://onlinelibrary.wiley.com/doi/abs/10.1111/1468-0017.00186.
  584. Patching gender: Non-binary utopias in HCI. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, CHI EA ’19, pp.  1–11, New York, NY, USA, 2019. Association for Computing Machinery. doi: 10.1145/3290607.3310425. https://doi.org/10.1145/3290607.3310425.
  585. Causation, Prediction, and Search. MIT Press, Cambridge, MA
  586. Inferring interpersonal relations in narrative summaries. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp.  2807–2813, Menlo Park, CA, 2016. Association for the Advancement of Artificial Intelligence. doi: 10.5555/3016100.3016294. https://dl.acm.org/doi/10.5555/3016100.3016294.
  587. Robert Stalnaker. Assertion. In P. Cole (ed.), Pragmatics, Syntax and Semantics 9, pp. 315--332. Brill, Leiden
  588. Evaluating gender bias in machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  1679--1684, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1164. https://aclanthology.org/P19-1164.

  589. A Method for Linguistic Metaphor Identification: From MIP to MIPVU. Converging Evidence in Language and Communication Research 14. John Benjamins, Amsterdam
  590. Neural networks -- a model of boolean functions. 5th International Workshop on Boolean Problems, Freiburg, Sept. 2002., 2002. https://www.researchgate.net/publication/246931125_Neural_Networks_-_A_Model_of_Boolean_Functions.

  591. Learning to summarize from human feedback
  592. Andreas Stöckl. Watching a language model learning chess. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp.  1369--1379, Held Online, September 2021. INCOMA Ltd. https://aclanthology.org/2021.ranlp-1.153.

  593. Metaphoric Paraphrase Generation
  594. Wikirelate! Computing semantic relatedness using Wikipedia. In AAAI’06: Proceedings of the 21st National Conference on Artificial Intelligence, volume 2, pp.  1419–1424. Association for the Advancement of Artificial Intelligence, 2006. doi: 10.5555/1597348.1597414. https://dl.acm.org/doi/10.5555/1597348.1597414.
  595. Prerequisite skills for reading comprehension: Multi-perspective analysis of MCTest datasets and systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, Menlo Park, CA, Feb. 2017. Association for the Advancement of Artificial Intelligence. https://ojs.aaai.org/index.php/AAAI/article/view/10957.

  596. Executing instructions in situated collaborative interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2119--2130, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1218. https://aclanthology.org/D19-1218.

  597. Evolution and impact of bias in human and machine learning algorithm interaction. PLOS ONE, 15(8):1--39, 08 2020. doi: 10.1371/journal.pone.0235502. https://doi.org/10.1371/journal.pone.0235502.

  598. Sequence to Sequence Learning with Neural Networks
  599. Rich Sutton. The bitter lesson. Incomplete Ideas, 2019. http://www.incompleteideas.net/IncIdeas/BitterLesson.html.

  600. LSTM networks can perform dynamic counting. In Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pp.  44--54, Florence, August 2019a. Association for Computational Linguistics. doi: 10.18653/v1/W19-3905. https://aclanthology.org/W19-3905.

  601. Memory-Augmented Recurrent Neural Networks Can Learn Generalized Dyck Languages
  602. ChePT -- applying deep neural transformer models to chess move prediction and self-commentary, 2021. https://web.stanford.edu/class/cs224n/reports/final_reports/report087.pdf.

  603. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating LLMs, 2022. https://openreview.net/forum?id=rK-7NhfSIW5.

  604. oLMpics -- On what Language Model Pre-training Captures
  605. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149--4158, Minneapolis, Minnesota, June 2019b. Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. https://aclanthology.org/N19-1421.

  606. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models
  607. Learning to recommend quotes for writing. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp.  2453–2459, Menlo Park, CA, 2015. Association for the Advancement of Artificial Intelligence. https://dl.acm.org/doi/10.5555/2886521.2886662.
  608. The teaching size: Computable teachers and learners for universal languages. Machine Learning, 108:1653--1675, 2019. doi: 10.1007/s10994-019-05821-2. https://doi.org/10.1007/s10994-019-05821-2.

  609. Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision
  610. Analog retrieval by constraint satisfaction. Artificial Intelligence, 46(3):259--310, 1990. doi: https://doi.org/10.1016/0004-3702(90)90018-U. https://www.sciencedirect.com/science/article/pii/000437029090018U.

  611. Representing numbers in NLP: A survey and a vision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  644--656, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.53. https://aclanthology.org/2021.naacl-main.53.

  612. Learning to interpret natural language commands through human-robot dialog. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-15, pp.  1923--1929, 2015. https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/10957/10931.

  613. Judith Jarvis Thomson. Killing, letting die, and the trolley problem. The Monist, 59(2):204--217, 1976. doi: 10.5840/monist197659224. https://doi.org/10.5840/monist197659224.

  614. LaMDA: Language Models for Dialog Applications
  615. Survey on collaborative filtering, content-based filtering and hybrid recommendation system. International Journal of Computer Applications, 110:31--36, 2015. doi: 10.5120/19308-0760. https://www.ijcaonline.org/archives/volume110/number4/19308-0760.

  616. FEVER: A large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  809--819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. https://aclanthology.org/N18-1074.

  617. Xiaoyu Tong. Metaphor paraphrasing and word sense disambiguation: Toward a new approach to automated metaphor, 2021. https://scripties.uba.uva.nl/download?fid=681664.

  618. Recent advances in neural metaphor processing: A linguistic, cognitive and social perspective. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4673--4686, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.372. https://aclanthology.org/2021.naacl-main.372.

  619. Chess as a Testbed for Language Model State Tracking
  620. Correlation-based network analysis combined with machine learning techniques highlight the role of the gaba shunt in brachypodium sylvaticum freezing tolerance. Scientific Reports, 10:no. 4489, 2020. https://doi.org/10.1038/s41598-020-61081-4.

  621. Neural Arithmetic Logic Units
  622. Omiotis: A thesaurus-based measure of text relatedness. In Wray Buntine, Marko Grobelnik, Dunja Mladenić, and John Shawe-Taylor (eds.), Machine Learning and Knowledge Discovery in Databases, pp.  742--745, Berlin, 2009. Springer.
  623. Text relatedness based on a word thesaurus. Journal of Artificial Intelligence Research, 37(1):1–40, Jan. 2010. doi: 10.5555/1861751.1861752. https://dl.acm.org/doi/10.5555/1861751.1861752.
  624. Metaphor detection with cross-lingual model transfer. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  248--258, Baltimore, Maryland, June 2014. Association for Computational Linguistics. doi: 10.3115/v1/P14-1024. https://aclanthology.org/P14-1024.

  625. Alan M. Turing. Computing machinery and intelligence. Mind, LIX(236):433--460, 10 1950. doi: 10.1093/mind/LIX.236.433. https://doi.org/10.1093/mind/LIX.236.433.

  626. Dating documents using graph convolution networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1605--1615, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1149. https://aclanthology.org/P18-1149.

  627. Fine-Grained Temporal Relation Extraction
  628. Temporal reasoning in natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  4070--4078, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.363. https://aclanthology.org/2020.findings-emnlp.363.

  629. Fill in the BLANC: Human-free quality estimation of document summaries. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pp.  11--20, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.eval4nlp-1.2. https://aclanthology.org/2020.eval4nlp-1.2.

  630. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30, pp.  5998–--6008. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

  631. The use of spatial relations in referring expression generation. In Proceedings of the Fifth International Natural Language Generation Conference, pp.  59--67, Salt Fork, Ohio, USA, June 2008. Association for Computational Linguistics. https://aclanthology.org/W08-1109.

  632. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575:350--354, 2019. doi: 10.1038/s41586-019-1724-z. https://doi.org/10.1038/s41586-019-1724-z.

  633. Computational argumentation quality assessment in natural language. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp.  176--187, Valencia, Spain, April 2017. Association for Computational Linguistics. https://aclanthology.org/E17-1017.

  634. Does GPT-2 know your phone number? Berkeley Artificial Intelligence Research blog, 20 Dec. 2020. https://bair.berkeley.edu/blog/2020/12/20/lmmem/.

  635. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  353--355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. https://aclanthology.org/W18-5446.

  636. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a. https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html.

  637. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5008--5020, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.450. https://aclanthology.org/2020.acl-main.450.

  638. GPT-J-6B: A 6 billion parameter autoregressive language model, May 2021. https://github.com/kingoflolz/mesh-transformer-jax.

  639. Learning to Count Objects with Few Exemplar Annotations
  640. Continuity of topic, interaction, and query: Learning to quote in online conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6640--6650, Online, November 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.538. https://aclanthology.org/2020.emnlp-main.538.

  641. SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver
  642. Learning language games through interaction. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2368--2378, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1224. https://aclanthology.org/P16-1224.

  643. It’s going to be okay: Measuring access to support in online communities. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  33--45, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1004. https://aclanthology.org/D18-1004.

  644. TalkDown: A corpus for condescension detection in context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3711--3719, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1385. https://aclanthology.org/D19-1385.

  645. Who uses web search for what: And how. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, pp.  15–24, New York, NY, USA, 2011. Association for Computing Machinery. doi: 10.1145/1935826.1935839. https://doi.org/10.1145/1935826.1935839.
  646. David Wechsler. Wechsler Adult Intelligence Scale–Fourth Edition (WAIS–IV). Pearson, San Antonio
  647. Finetuned Language Models Are Zero-Shot Learners
  648. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  649. Humor detection: A transformer gets the last laugh. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3621--3625, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1372. https://aclanthology.org/D19-1372.

  650. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
  651. Lexicosyntactic inference in neural models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  4717--4724, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1501. https://aclanthology.org/D18-1501.

  652. Revisiting the strange stories: Revealing mentalizing impairments in autism. Child Development, 80(4):1097--1117, 2009. doi: https://doi.org/10.1111/j.1467-8624.2009.01319.x. https://srcd.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8624.2009.01319.x.
  653. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112--1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. https://aclanthology.org/N18-1101.

  654. Cognitive and emotional demands of black humour processing: The role of intelligence, aggressiveness and mood. Cognitive Processing, 18:159--167, 2017. doi: https://doi.org/10.1007/s10339-016-0789-y. https://doi.org/10.1007/s10339-016-0789-y.

  655. Language models are few-shot multilingual learners. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pp.  1--15, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.mrl-1.1. https://aclanthology.org/2021.mrl-1.1.

  656. Terry Winograd. Understanding natural language. Cognitive Psychology, 3(1):1--191, 1972. doi: https://doi.org/10.1016/0010-0285(72)90002-3. https://www.sciencedirect.com/science/article/pii/0010028572900023.

  657. Ludwig Wittgenstein. Philosophical investigations. Basil Blackwell, Oxford
  658. Thomas Wolf. Some additional experiments extending the tech report ‘‘assessing BERT’s syntactic abilities’’ by Yoav Goldberg, 2019. https://huggingface.co/bert-syntax/extending-bert-syntax.pdf.

  659. HuggingFace's Transformers: State-of-the-art Natural Language Processing
  660. An embedding method for unseen words considering contextual information and morphological information. In Proceedings of the 36th Annual ACM Symposium on Applied Computing, SAC ’21, pp.  1055–1062, New York, NY, USA, 2021. Association for Computing Machinery. doi: 10.1145/3412841.3441982. https://doi.org/10.1145/3412841.3441982.
  661. Learning data transformation rules through examples: Preliminary results. In Proceedings of the Ninth International Workshop on Information Integration on the Web, IIWeb ’12, New York, NY, USA, 2012. Association for Computing Machinery. doi: 10.1145/2331801.2331809. https://doi.org/10.1145/2331801.2331809.
  662. Applying the Transformer to Character-level Transduction
  663. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
  664. The Causal-Neural Connection: Expressiveness, Learnability, and Inference
  665. On Hallucination and Predictive Uncertainty in Conditional Language Generation
  666. Recipes for Safety in Open-domain Chatbots
  667. AutoQA: From databases to QA semantic parsers with only synthetic training data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  422--434, Online, November 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.31. https://aclanthology.org/2020.emnlp-main.31.

  668. Incorporating latent meanings of morphological compositions to enhance word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1232--1242, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1114. https://aclanthology.org/P18-1114.

  669. ByT5: Towards a token-free future with pre-trained byte-to-byte models
  670. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  483--498, Online, June 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. https://aclanthology.org/2021.naacl-main.41.

  671. Who's to say what's funny? A computer using Language Models and Deep Learning, That's Who!
  672. Humor recognition and humor anchor extraction. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  2367--2376, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1284. https://aclanthology.org/D15-1284.

  673. Learning to Prove Theorems via Interacting with Proof Assistants
  674. Scott Cheng-Hsin Yang and Patrick Shafto. Explainable artificial intelligence via Bayesian teaching. Workshop on Teaching Machines, Robots, and Humans, NIPS 2017, 2017. http://shaftolab.com/assets/papers/yangShafto_NIPS_2017_machine_teaching.pdf.

  675. WikiWalk: Random walks on Wikipedia for semantic relatedness. In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (TextGraphs-4), pp.  41--49, Suntec, Singapore, August 2009. Association for Computational Linguistics. https://aclanthology.org/W09-3206.

  676. Optimizing sentence modeling and selection for document summarization. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI-15, pp.  1383–1389, 2015. doi: 10.5555/2832415.2832442. https://dl.acm.org/doi/10.5555/2832415.2832442.
  677. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  3911--3921, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1425. https://aclanthology.org/D18-1425.

  678. CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  1962--1979, Hong Kong, China, November 2019a. Association for Computational Linguistics. doi: 10.18653/v1/D19-1204. https://aclanthology.org/D19-1204.

  679. SParC: Cross-domain semantic parsing in context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4511--4523, Florence, Italy, July 2019b. Association for Computational Linguistics. doi: 10.18653/v1/P19-1443. https://aclanthology.org/P19-1443.

  680. ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning
  681. Learning the Dyck language with attention-based Seq2Seq models. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  138--146, Florence, Italy, August 2019c. Association for Computational Linguistics. doi: 10.18653/v1/W19-4815. https://aclanthology.org/W19-4815.

  682. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
  683. Eliezer Yudkowsky. Artificial intelligence as a positive and negative factor in global risk. In Nick Bostrom and Milan M. Ćirković (eds.), Global Catastrophic Risks, pp.  308--345. Oxford University Press, Oxford, 2008. https://web.archive.org/web/20210125025955/https://intelligence.org/files/AIPosNegFactor.pdf.

  684. Learning to Execute
  685. Figure me out: A gold standard dataset for metaphor interpretation. In Proceedings of the 12th Language Resources and Evaluation Conference, pp.  5810--5819, Marseille, France, May 2020. European Language Resources Association. https://aclanthology.org/2020.lrec-1.712.

  686. From Recognition to Cognition: Visual Commonsense Reasoning
  687. HellaSwag: Can a Machine Really Finish Your Sentence?
  688. Defending Against Neural Fake News
  689. The gap of semantic parsing: A survey on automatic math word problem solvers. IEEE Transactions on Pattern Analysis & Machine Intelligence, 42(09):2287--2305, Sep. 2020a. doi: 10.1109/TPAMI.2019.2914054.
  690. Hurtful words: Quantifying biases in clinical contextual word embeddings. In Proceedings of the ACM Conference on Health, Inference, and Learning, CHIL ’20, pp.  110–120, New York, NY, USA, 2020b. Association for Computing Machinery. doi: 10.1145/3368555.3384448. https://doi.org/10.1145/3368555.3384448.
  691. WinoWhy: A deep diagnosis of essential commonsense knowledge for answering Winograd schema challenge. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5736--5745, Online, July 2020c. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.508. https://aclanthology.org/2020.acl-main.508.

  692. Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms
  693. Reasoning about goals, steps, and temporal ordering with WikiHow. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  4630--4639, Online, November 2020d. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.374. https://aclanthology.org/2020.emnlp-main.374.

  694. Tweet sarcasm detection using deep neural network. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp.  2449--2460, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. https://aclanthology.org/C16-1231.

  695. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension
  696. Irony detection via sentiment-based transfer learning. Information Processing & Management, 56(5):1633--1644, 2019b. doi: https://doi.org/10.1016/j.ipm.2019.04.006. https://www.sciencedirect.com/science/article/pii/S0306457318307428.

  697. Learning to Count Objects in Natural Images for Visual Question Answering
  698. When Do You Need Billions of Words of Pretraining Data?
  699. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.  15--20, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2003. https://aclanthology.org/N18-2003.

  700. Calibrate Before Use: Improving Few-Shot Performance of Language Models
  701. "Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding
  702. Detecting Hallucinated Content in Conditional Neural Sequence Generation
  703. Learning to ask unanswerable questions for machine reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4238--4248, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1415. https://aclanthology.org/P19-1415.

  704. Xiaojin Zhu. Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, Menlo Park, CA, Mar. 2015. Association for the Advancement of Artificial Intelligence. https://ojs.aaai.org/index.php/AAAI/article/view/9761.

  705. ST-MoE: Designing Stable and Transferable Sparse Expert Models
  706. Alan Zucconi. The secrets of colour interpolation, 6 Jan. 2016. https://www.alanzucconi.com/2016/01/06/colour-interpolation/.

Show All 706

Test Your Knowledge

You answered out of questions correctly.

Well done!