Revealing Trends in Datasets from the 2022 ACL and EMNLP Conferences
Abstract: Natural language processing (NLP) has grown significantly since the advent of the Transformer architecture. Transformers have given birth to pre-trained LLMs (PLMs). There has been tremendous improvement in the performance of NLP systems across several tasks. NLP systems are on par or, in some cases, better than humans at accomplishing specific tasks. However, it remains the norm that \emph{better quality datasets at the time of pretraining enable PLMs to achieve better performance, regardless of the task.} The need to have quality datasets has prompted NLP researchers to continue creating new datasets to satisfy particular needs. For example, the two top NLP conferences, ACL and EMNLP, accepted ninety-two papers in 2022, introducing new datasets. This work aims to uncover the trends and insights mined within these datasets. Moreover, we provide valuable suggestions to researchers interested in curating datasets in the future.
- JamPatoisNLI: A jamaican patois natural language inference dataset. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5307–5320, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- EUR-lex-sum: A multi- and cross-lingual dataset for long-form summarization in the legal domain. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7626–7639, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Elisa Bassignana and Barbara Plank. 2022. CrossRE: A cross-domain dataset for relation extraction. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3592–3604, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- BioNLI: Generating a biomedical NLI dataset using lexico-semantic constraints for adversarial examples. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5093–5104, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- DiS-ReX: A multilingual dataset for distantly supervised relation extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 849–863, Dublin, Ireland. Association for Computational Linguistics.
- Human-machine collaboration approaches to build a dialogue dataset for hate speech countering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8031–8049, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- KQA pro: A dataset with explicit compositional programs for complex question answering over knowledge base. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6101–6119, Dublin, Ireland. Association for Computational Linguistics.
- LexGLUE: A benchmark dataset for legal language understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, Dublin, Ireland. Association for Computational Linguistics.
- SummScreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8602–8615, Dublin, Ireland. Association for Computational Linguistics.
- IAM: A comprehensive and large-scale dataset for integrated argument mining tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2277–2287, Dublin, Ireland. Association for Computational Linguistics.
- HiTab: A hierarchical table dataset for question answering and natural language generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1094–1110, Dublin, Ireland. Association for Computational Linguistics.
- A dataset for hyper-relational extraction and a cube-filling approach. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10114–10133, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- MT-GenEval: A counterfactual and contextual dataset for evaluating gender accuracy in machine translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4287–4299, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Title2Event: Benchmarking open event extraction with a large-scale Chinese title dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6511–6524, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- e-CARE: a new dataset for exploring explainable causal reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432–446, Dublin, Ireland. Association for Computational Linguistics.
- Cross-document event coreference search: Task, dataset and modeling. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 900–913, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Anton Eklund and Mona Forsman. 2022. Topic modeling by clustering language model embeddings: Human validation on an industry dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 635–643, Abu Dhabi, UAE. Association for Computational Linguistics.
- BanglaRQA: A benchmark dataset for under-resourced Bangla language reading comprehension-based question answering with diverse question-answer types. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2518–2532, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- HumSet: Dataset of multilingual information extraction and classification for humanitarian crises response. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4379–4389, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Your answer is incorrect… would you like to know why? introducing a bilingual short answer feedback dataset. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8577–8591, Dublin, Ireland. Association for Computational Linguistics.
- CICERO: A dataset for contextualized commonsense inference in dialogues. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5010–5028, Dublin, Ireland. Association for Computational Linguistics.
- Sujatha Das Gollapalli and Xiaoli Li. 2015. EMNLP versus ACL: Analyzing NLP research over time. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2002–2006, Lisbon, Portugal. Association for Computational Linguistics.
- Questioning the validity of summarization datasets and improving their factual consistency. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5716–5727, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics.
- KOLD: Korean offensive language dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10818–10833, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Chinese synesthesia detection: New dataset and models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3877–3887, Dublin, Ireland. Association for Computational Linguistics.
- RNSum: A large-scale dataset for automatic release note generation via commit logs summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8718–8735, Dublin, Ireland. Association for Computational Linguistics.
- WatClaimCheck: A new dataset for claim entailment and inference. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1293–1304, Dublin, Ireland. Association for Computational Linguistics.
- Simple questions generate named entity recognition datasets. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6220–6236, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- BotsTalk: Machine-sourced framework for automatic curation of large-scale multi-skill dialogue datasets. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5149–5170, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- BOOKSUM: A collection of datasets for long-form narrative summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6536–6558, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- IndicNLG benchmark: Multilingual datasets for diverse NLG tasks in Indic languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5363–5394, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Near-negative distinction: Giving a second life to human evaluation datasets. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2094–2108, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Multilingual SubEvent relation extraction: A novel dataset and structure induction method. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5559–5570, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Bloom library: Multimodal datasets in 300+ languages for a variety of downstream tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8608–8621, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- MSCTD: A multimodal sentiment chat translation dataset. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2601–2613, Dublin, Ireland. Association for Computational Linguistics.
- WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- FigMemes: A dataset for figurative language identification in politically-opinionated memes. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7069–7086, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Augmenting multi-turn text-to-SQL datasets with self-play. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5608–5620, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Measuring context-word biases in lexical semantic datasets. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2699–2713, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Long text and multi-table summarization: Dataset and method. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1995–2010, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Code generation from flowcharts with texts: A benchmark dataset and an approach. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6069–6077, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Antoine Louis and Gerasimos Spanakis. 2022. A statutory article retrieval dataset in French. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6789–6803, Dublin, Ireland. Association for Computational Linguistics.
- EnCBP: A new benchmark dataset for finer-grained cultural background prediction in English. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2811–2823, Dublin, Ireland. Association for Computational Linguistics.
- A benchmark and dataset for post-OCR text correction in Sanskrit. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6258–6265, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Listening to affected communities to define extreme speech: Dataset and experiments. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1089–1104, Dublin, Ireland. Association for Computational Linguistics.
- Leveraging QA datasets to improve generative data augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9737–9750, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- ECTSum: A new benchmark dataset for bullet point summarization of long earnings call transcripts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10893–10906, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- HybriDialogue: An information-seeking dialogue dataset grounded on tabular and textual data. In Findings of the Association for Computational Linguistics: ACL 2022, pages 481–492, Dublin, Ireland. Association for Computational Linguistics.
- French CrowS-pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8521–8531, Dublin, Ireland. Association for Computational Linguistics.
- M3: Multi-level dataset for multi-document summarisation of medical studies. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3887–3901, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- PriMock57: A dataset of primary care mock consultations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 588–598, Dublin, Ireland. Association for Computational Linguistics.
- MEE: A novel multilingual event extraction dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9603–9613, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Three real-world datasets and neural computational models for classification tasks in patent landscaping. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11498–11513, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- DuReadervissubscriptDuReadervis\textrm{DuReader}_{\textrm{vis}}DuReader start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT: A Chinese dataset for open-domain document visual question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1338–1351, Dublin, Ireland. Association for Computational Linguistics.
- Commonsense knowledge salience evaluation with a benchmark dataset in E-commerce. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 14–27, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- CONDAQA: A contrastive reading comprehension dataset for reasoning about negation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8729–8755, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- M2D2: A massively multi-domain language modeling dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 964–975, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- MReD: A meta-review dataset for structure-controllable text generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2521–2535, Dublin, Ireland. Association for Computational Linguistics.
- “I’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9180–9211, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Anirudh Srinivasan and Eunsol Choi. 2022. TyDiP: A dataset for politeness classification in nine typologically diverse languages. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5723–5738, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Abigail Sticha and Paul Brenner. 2022. Hybrid knowledge engineering leveraging a robust ML framework to produce an assassination dataset. In Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE), pages 106–116, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- ConditionalQA: A complex reading comprehension dataset with conditional answers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3627–3637, Dublin, Ireland. Association for Computational Linguistics.
- On the safety of conversational models: Taxonomy, dataset, and benchmark. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3906–3923, Dublin, Ireland. Association for Computational Linguistics.
- Visual named entity linking: A new dataset and a baseline. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2403–2415, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- PHEE: A dataset for pharmacovigilance event extraction from text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5571–5587, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- WLASL-LEX: a dataset for recognising phonological properties in American Sign Language. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 453–463, Dublin, Ireland. Association for Computational Linguistics.
- A multi-modal dataset for hate speech detection on social media: Case-study of russia-Ukraine conflict. In Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE), pages 1–6, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 715–729, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- CDialog: A multi-turn covid-19 conversation dataset for entity-aware dialog generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11373–11385, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- SQuALITY: Building a long-document summarization dataset the hard way. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1139–1156, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- ClidSum: A benchmark dataset for cross-lingual dialogue summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7716–7729, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Textomics: A dataset for genomics data summary generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4878–4891, Dublin, Ireland. Association for Computational Linguistics.
- ParaTag: A dataset of paraphrase tagging for fine-grained labels, NLG evaluation, and data augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7111–7122, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- MAVEN-ERE: A unified large-scale dataset for event coreference, temporal, causal, and subevent relation extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 926–941, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- WikiDiverse: A multimodal entity linking dataset with diversified contextual topics and entity types. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4785–4797, Dublin, Ireland. Association for Computational Linguistics.
- Peratham Wiriyathammabhum. 2022. ClassBases at the CASE-2022 multilingual protest event detection task: Multilingual protest news detection and automatically replicating manually created event datasets. In Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE), pages 149–154, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- XFUND: A benchmark dataset for multilingual visually rich form understanding. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3214–3224, Dublin, Ireland. Association for Computational Linguistics.
- Fantastic questions and where to find them: FairytaleQA – an authentic dataset for narrative comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 447–460, Dublin, Ireland. Association for Computational Linguistics.
- APEACH: Attacking pejorative expressions with analysis on crowd-generated hate speech evaluation datasets. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7076–7086, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- PcMSP: A dataset for scientific action graphs extraction from polycrystalline materials synthesis procedure text. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6033–6046, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- D4: a Chinese dialogue dataset for depression-diagnosis-oriented chat. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2438–2459, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- LEVEN: A large-scale Chinese legal event detection dataset. In Findings of the Association for Computational Linguistics: ACL 2022, pages 183–201, Dublin, Ireland. Association for Computational Linguistics.
- ZeroGen: Efficient zero-shot learning via dataset generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11653–11669, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- ProGen: Progressive zero-shot dataset generation via in-context feedback. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3671–3683, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Beyond counting datasets: A survey of multilingual dataset construction and necessary resources. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3725–3743, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- MovieUN: A dataset for movie understanding and narrating. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1873–1885, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- NarraSum: A large-scale dataset for abstractive narrative summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 182–197, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- A fine-grained Chinese software privacy policy dataset for sequence labeling and regulation compliant identification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10266–10277, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- JDDC 2.1: A multimodal Chinese dialogue dataset with joint tasks of query rewriting, response generation, discourse parsing, and summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12037–12051, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Towards identifying social bias in dialog systems: Framework, dataset, and benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3576–3591, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Duqm: A chinese dataset of linguistically perturbed natural questions for evaluating the robustness of question matching models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.