Papers
Topics
Authors
Recent
Search
2000 character limit reached

Japanese Lexical Complexity for Non-Native Readers: A New Dataset

Published 30 Jun 2023 in cs.CL | (2306.17399v1)

Abstract: Lexical complexity prediction (LCP) is the task of predicting the complexity of words in a text on a continuous scale. It plays a vital role in simplifying or annotating complex words to assist readers. To study lexical complexity in Japanese, we construct the first Japanese LCP dataset. Our dataset provides separate complexity scores for Chinese/Korean annotators and others to address the readers' L1-specific needs. In the baseline experiment, we demonstrate the effectiveness of a BERT-based system for Japanese LCP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. The development of an electronic dictionary for morphological analysis and its application to japanese corpus linguistics. Japanese Linguistics, 22(5):101–123.
  2. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  3. Sian Gooding and Ekaterina Kochmar. 2019. Complex word identification as a sequence labelling task. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1148–1153, Florence, Italy. Association for Computational Linguistics.
  4. Word complexity is in the eye of the beholder. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4439–4449, Online. Association for Computational Linguistics.
  5. Sian Gooding and Manuel Tragut. 2022. One size does not fit all: The case for personalised word complexity models. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 353–365, Seattle, United States. Association for Computational Linguistics.
  6. Marcella Hu and I.S.P Nation. 2000. Unknown Vocabulary Density and Reading Comprehension. Reading in a Foreign Language, 13(1):403–30.
  7. Detecting multiword expression type helps lexical complexity assessment. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4426–4435, Marseille, France. European Language Resources Association.
  8. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  9. Keiko Koda. 1989. The effects of transferred vocabulary knowledge on the development of L2 reading proficiency. Foreign Lang. Ann., 22(6):529–540.
  10. Bunshō rikai o sokushin suru goi chishiki no ryōteki sokumen : Kichigo ritsu no ikichi tansaku no kokoromi [What percentage of known words in a text facilities reading comprehension? : A Case Study for Exploration of the Threshold of Known Words] (in Japanese). Nihongo Kyōiku [Journal of Japanese Language Teaching], 120:83–92.
  11. Adaptation of long-unit-word analysis system to different part-of-speech tagset. Journal of Natural Language Processing, 21(2):379–401.
  12. Klaus Krippendorff. 1970. Bivariate Agreement Coefficients for Reliability of Data. Sociological Methodology, 2:139–150.
  13. Applying conditional random fields to japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 230–237, Barcelona, Spain. Association for Computational Linguistics.
  14. A dictionary of japanese functional expressions with hierarchical organization. Journal of Natural Language Processing, 14(5):123–146.
  15. Daiki Nishihara and Tomoyuki Kajiwara. 2020. Word complexity estimation for Japanese lexical simplification. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3114–3120, Marseille, France. European Language Resources Association.
  16. Gendai nihongo kakikotoba kinkō kōpasu keitairon kiteishū dai 4 ban jō [Regulations of morphological information for balanced corpus of contemporary written Japanese 4th edition volume 1] (in Japanese). NINJAL Internal Reports.
  17. Word delimitation issues in UD Japanese. In Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021), pages 142–150, Sofia, Bulgaria. Association for Computational Linguistics.
  18. Context availability and lexical decisions for abstract and concrete words. J. Mem. Lang., 27(5):499–520.
  19. CompLex: A new corpus for lexical complexity prediction from Likert Scale data. In Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI), pages 57–62, Marseille, France. European Language Resources Association.
  20. SemEval-2021 task 1: Lexical complexity prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1–16, Online. Association for Computational Linguistics.
  21. Predicting lexical complexity in english texts: the complex 2.0 dataset. Language Resources and Evaluation, 56(4):1153–1194.
  22. Lexical simplification benchmarks for english, portuguese, and spanish. Front Artif Intell, 5:991242.
  23. The construction of a database to support the compilation of japanese learners’ dictionaries. Acta Linguistica Asiatica, 2(2):97.
  24. A computer readability formula of japanese texts for machine scoring. In Coling Budapest 1988 Volume 2: International Conference on Computational Linguistics.
  25. OCHADAI-KYOTO at SemEval-2021 task 1: Enhancing model generalization and robustness for lexical complexity prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 17–23, Online. Association for Computational Linguistics.
  26. A report on the complex word identification shared task 2018. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 66–78, New Orleans, Louisiana. Association for Computational Linguistics.
Citations (5)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.