Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lacuna Language Learning: Leveraging RNNs for Ranked Text Completion in Digitized Coptic Manuscripts (2407.12247v1)

Published 17 Jul 2024 in cs.CL

Abstract: Ancient manuscripts are frequently damaged, containing gaps in the text known as lacunae. In this paper, we present a bidirectional RNN model for character prediction of Coptic characters in manuscript lacunae. Our best model performs with 72% accuracy on single character reconstruction, but falls to 37% when reconstructing lacunae of various lengths. While not suitable for definitive manuscript reconstruction, we argue that our RNN model can help scholars rank the likelihood of textual reconstructions. As evidence, we use our RNN model to rank reconstructions in two early Coptic manuscripts. Our investigation shows that neural models can augment traditional methods of textual restoration, providing scholars with an additional tool to assess lacunae in Coptic manuscripts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Restoring ancient text using deep learning: a case study on Greek epigraphy. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6368–6375, Hong Kong, China. Association for Computational Linguistics.
  2. Usage of language model for the filling of lacunae in ancient Latin inscriptions: A case study. In Proceedings of the 2nd Workshop on Artificial Intelligence for Cultural Heritage (IAI4CH 2023) co-located with the 22nd International Conference of the Italian Association for Artificial Intelligence (AIxIA 2023), Roma, Italy, November 6, 2023, volume 3536 of CEUR Workshop Proceedings, pages 113–125. CEUR-WS.org.
  3. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint, arXiv:1810.04805.
  5. A linked Coptic dictionary online. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 12–21, Santa Fe, New Mexico. Association for Computational Linguistics.
  6. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  7. Bentley Layton. 2011. A Coptic Grammar, third edition, revised and expanded edition. Porta linguarum orientalium 20. Harrassowitz, Wiesbaden.
  8. Bentley Layton and Wesley W. Isenberg. 1989. The Gospel of Philip, Nag Hammadi Codex II, 2-7. NHS 20. Brill, Leiden.
  9. Mallory E. Matsumoto. 2022. Archaeology and epigraphy in the digital era. Journal of Archaeological Research, 30(2):285–320.
  10. Deepak Mishra. 2021. [Masked] Language Modeling with Recurrent Neural Networks. Medium.
  11. Anna Novokhatko and Felix K. Maier. 2022. Digital methods of analysing and reconstructing Ancient Greek and Latin texts. Classics@, 20.
  12. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  13. A markov model of the indus script. Proceedings of the National Academy of Sciences, 106(33):13685–13690.
  14. Hans-Martin Schenke. 1997. Das Philippus-Evangelium (Nag-Hammadi-Codex II,3). Neu herausgegeben, übersetzt und erklärt. Texte und Untersuchungen 143. Akademie, Berlin.
  15. Caroline T Schroeder and Amir Zeldes. 2016. Raiders of the Lost Corpus. Digital humanities quarterly, 10(2). Place: Providence.
  16. Charlotte Schubert. 2011. Das portal eaqua: Neue methoden in der geisteswissenschaftlichen forschung ii. Technical report, Universität Leipzig.
  17. Machine Learning for Ancient Languages: A Survey. Computational Linguistics, 49(3):703–747.
  18. Lacuna reconstruction: Self-supervised pre-training for low-resource historical document transcription. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 206–216, Seattle, United States. Association for Computational Linguistics.
  19. Nicholas E. Wagner. 2022. Fragments of Sahidic Isaiah at Duke University (P.Duk. inv. 282). Journal of Coptic studies, 24:319–330.
  20. Tommy Wasserman. 2013. Criteria for evaluating readings in New Testament textual criticism. In Bart D. Ehrman and Michael W. Holmes, editors, The Text of the New Testament in Contemporary Research: Essays on the Status Quaestionis, pages 579–612. Brill, Leiden, The Netherlands.
  21. Amir Zeldes and Mitchell Abrams. 2018. The Coptic Universal Dependency treebank. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 192–201, Brussels, Belgium. Association for Computational Linguistics.
  22. Amir Zeldes and Caroline T. Schroeder. 2016. An NLP pipeline for Coptic. In Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 146–155, Berlin, Germany. Association for Computational Linguistics.
  23. Shuo Zhang and Amir Zeldes. 2017. GitDOX: A linked version controlled online XML editor for manuscript transcription. In Proceedings of FLAIRS-30, pages 619–623, Marco Island, FL.
Citations (1)

Summary

We haven't generated a summary for this paper yet.