Pairing Orthographically Variant Literary Words to Standard Equivalents Using Neural Edit Distance Models (2401.15068v1)
Abstract: We present a novel corpus consisting of orthographically variant words found in works of 19th century U.S. literature annotated with their corresponding "standard" word pair. We train a set of neural edit distance models to pair these variants with their standard forms, and compare the performance of these models to the performance of a set of neural edit distance models trained on a corpus of orthographic errors made by L2 English learners. Finally, we analyze the relative performance of these models in the light of different negative training sample generation strategies, and offer concluding remarks on the unique challenge literary orthographic variation poses to string pairing methodologies.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
- Steffen Eger and Yannik Benz. 2020. From hero to z\\\backslash\’eroe: A benchmark of low-level adversarial attacks. arXiv preprint arXiv:2010.05648.
- WÂ Nelson Francis and Henry Kucera. 1964. A standard corpus of present-day edited american english, for use with digital computers. Brown University, Providence, 2.
- Sumner Ives. 1971. A theory of literary dialect. A various language: Perspectives on American dialects, pages 145–177.
- Gavin Jones. 1999. Strange talk: The politics of dialect literature in Gilded Age America. Univ of California Press.
- Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union.
- Jindřich Libovický and Alexander Fraser. 2022. Neural string edit distance. In Proceedings of the Sixth Workshop on Structured Prediction for NLP, pages 52–66, Dublin, Ireland. Association for Computational Linguistics.
- George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
- Diane Nicholls. 2003. The cambridge learner corpus: Error coding and analysis for lexicography and elt. In Proceedings of the Corpus Linguistics 2003 conference, volume 16, pages 572–581. Cambridge University Press Cambridge.
- Are automatic methods for cognate detection good enough for phylogenetic reconstruction in historical linguistics? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 393–400, New Orleans, Louisiana. Association for Computational Linguistics.
- Eric Sven Ristad and Peter N Yianilos. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):522–532.
- Claude E Shannon. 1951. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64.
- A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189, Portland, Oregon, USA. Association for Computational Linguistics.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.