Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus (2405.11877v5)

Published 20 May 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Natural language inference (NLI), the task of recognizing the entailment relationship in sentence pairs, is an actively studied topic serving as a proxy for natural language understanding. Despite the relevance of the task in building conversational agents and improving text classification, machine translation and other NLP tasks, to the best of our knowledge, there is no publicly available NLI corpus for the Romanian language. To this end, we introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs, which are obtained via distant supervision, and 6K validation and test sentence pairs, which are manually annotated with the correct labels. We conduct experiments with multiple machine learning methods based on distant learning, ranging from shallow models based on word embeddings to transformer-based neural networks, to establish a set of competitive baselines. Furthermore, we improve on the best model by employing a new curriculum learning strategy based on data cartography. Our dataset and code to reproduce the baselines are available at https://github.com/Eduard6421/RONLI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5307–5320, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  2. Curriculum learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 41–48, New York, NY, USA. ACM.
  3. Effective Cross-Task Transfer Learning for Explainable Natural Language Inference with T5. In Proceedings of the 3rd Workshop on Figurative Language Processing (FLP), pages 54–60, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  4. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly, Beijing.
  5. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  6. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  7. Data and Representation for Turkish Natural Language Inference. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8253–8267, Online. Association for Computational Linguistics.
  8. Figurative language in recognizing textual entailment. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3354–3361, Online. Association for Computational Linguistics.
  9. NeuralLog: Natural Language Inference with Joint Neural and Logical Reasoning. In Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, pages 78–88, Online. Association for Computational Linguistics.
  10. XNLI: Evaluating Cross-lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  11. The PASCAL Recognising Textual Entailment Challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pages 177–190.
  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4171–4186.
  13. Carmen Dobrovie-Sorin. 1994. The Syntax of Romanian. Comparative Studies in Romance. De Gruyter Mouton, Berlin, Boston.
  14. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics (EMNLP), pages 4324–4328.
  15. Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220.
  16. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 107–112.
  17. OCNLI: Original Chinese Natural Language Inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3512–3526, Online. Association for Computational Linguistics.
  18. SciTaiL: A Textual Entailment Dataset from Science Question Answering. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), volume 17, pages 41–42.
  19. Yuta Koreeda and Christopher Manning. 2021. ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1907–1919, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  20. Defining textual entailment. Journal of the Association for Information Science and Technology, 69(6):763–772.
  21. Venelin Kovatchev and Mariona Taulé. 2022. InferES: A natural language inference corpus for Spanish featuring negation-based contrastive and adversarial examples. In Proceedings of the 29th International Conference on Computational Linguistics (COLING), pages 3873–3884.
  22. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  23. Simple but Challenging: Natural Language Inference Models Fail on Simple Sentences. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3449–3462, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  24. IndoNLI: A Natural Language Inference Dataset for Indonesian. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 10511–10527, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  25. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), pages 216–223.
  26. DocInfer: Document-level Natural Language Inference using Optimal Evidence Selection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 809–824, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  27. Entailment Semantics Can Be Extracted from an Ideal Language Model. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 176–193, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  28. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011.
  29. Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1754–1768, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  30. Length-Based Curriculum Learning for Efficient Pre-training of Language Models. New Generation Computing, 41(1):109–134.
  31. RoGPT2: Romanian GPT2 for Text Generation. In 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), pages 1154–1161. IEEE.
  32. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4885–4901. Association for Computational Linguistics.
  33. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 27730–27744. Curran Associates, Inc.
  34. A weakly supervised textual entailment approach to zero-shot text classification. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 286–296, Dubrovnik, Croatia. Association for Computational Linguistics.
  35. Structural Constraints and Natural Language Inference for End-to-End Flowchart Grounded Dialog Response Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 10763–10774, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  36. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2383–2392.
  37. The ASSIN 2 Shared Task: A Quick Overview. In Proceedings of the 14th International Conference on Computational Processing of the Portuguese Language (PROPOR 2020), pages 406–412.
  38. Alexey Romanov and Chaitanya Shivade. 2018. Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1586–1596, Brussels, Belgium. Association for Computational Linguistics.
  39. RoDia: A New Dataset for Romanian Dialect Identification from Speech. In Findings of the Association for Computational Linguistics: NAACL 2024.
  40. Mobashir Sadat and Cornelia Caragea. 2022a. Learning to Infer from Unlabeled Data: A Semi-supervised Learning Approach for Robust Natural Language Inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4763–4776, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  41. Mobashir Sadat and Cornelia Caragea. 2022b. SciNLI: A corpus for natural language inference on scientific text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 7399–7409, Dublin, Ireland. Association for Computational Linguistics.
  42. Enhancing descriptive image captioning with natural language inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), pages 269–277, Online. Association for Computational Linguistics.
  43. Shallow Discourse Parsing for Under-Resourced Languages: Combining Machine Translation and Annotation Projection. In Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC), pages 1044–1050.
  44. Investigating multi-source active learning for natural language inference. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 2187–2209, Dubrovnik, Croatia. Association for Computational Linguistics.
  45. Curriculum learning: A survey. International Journal of Computer Vision, 130:1526–1565.
  46. Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, Online. Association for Computational Linguistics.
  47. Unsupervised Natural Language Inference Using PHL Triplet Generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2003–2016, Dublin, Ireland. Association for Computational Linguistics.
  48. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), volume 30, pages 6000–6010.
  49. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In Advances in Neural Information Processing Systems (NeurIPS), volume 32. Curran Associates, Inc.
  50. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In International Conference on Learning Representations (ICLR).
  51. Capture human disagreement distributions by calibrated networks for natural language inference. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1524–1535, Dublin, Ireland. Association for Computational Linguistics.
  52. Gijs Wijnholds. 2023. Assessing Monotonicity Reasoning in Dutch through Natural Language Inference. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1494–1500, Dubrovnik, Croatia. Association for Computational Linguistics.
  53. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1112–1122.
  54. Roman V. Yampolskiy. 2013. Turing Test as a Defining Feature of AI-Completeness, pages 3–17. Springer Berlin Heidelberg, Berlin, Heidelberg.
  55. Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3914–3923, Hong Kong, China. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Eduard Poesina (3 papers)
  2. Cornelia Caragea (58 papers)
  3. Radu Tudor Ionescu (103 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.