ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis (2307.01387v1)
Abstract: The computational analysis of poetry is limited by the scarcity of tools to automatically analyze and scan poems. In a multilingual settings, the problem is exacerbated as scansion and rhyme systems only exist for individual languages, making comparative studies very challenging and time consuming. In this work, we present \textsc{Alberti}, the first multilingual pre-trained LLM for poetry. Through domain-specific pre-training (DSP), we further trained multilingual BERT on a corpus of over 12 million verses from 12 languages. We evaluated its performance on two structural poetry tasks: Spanish stanza type classification, and metrical pattern prediction for Spanish, English and German. In both cases, \textsc{Alberti} outperforms multilingual BERT and other transformers-based models of similar sizes, and even achieves state-of-the-art results for German when compared to rule-based systems, demonstrating the feasibility and effectiveness of DSP in the poetry domain.
- Semantics of european poetry is shaped by conservative forces: The relationship between poetic meter and meaning in accentual-syllabic verse, Plos one 17 (2022) e0266556.
- Domain-specific language model pretraining for biomedical natural language processing 3 (2021). URL: https://doi.org/10.1145/3458754. doi:10.1145/3458754.
- BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/v1/N19-1423.
- Attention is all you need, Advances in neural information processing systems 30 (2017).
- E. Manjavacas Arevalo, L. Fonteyn, MacBERTh: Development and evaluation of a historically pre-trained language model for English (1450-1950), in: Proceedings of the Workshop on Natural Language Processing for Digital Humanities, NLP Association of India (NLPAI), NIT Silchar, India, 2021, pp. 23–36. URL: https://aclanthology.org/2021.nlp4dh-1.4.
- S. Schweter, L. März, Triple e - effective ensembling of embeddings and language models for ner of historical german, in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, Thessaloniki, Greece, 2020. URL: http://ceur-ws.org/Vol-2696/paper_173.pdf.
- J. H. Lau, et al., Deep-speare: A joint neural model of poetic language, meter and rhyme, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 2018, pp. 1948–1958. URL: https://doi.org/10.18653/v1/P18-1181. doi:10.18653/v1/P18-1181.
- PoeLM: A meter- and rhyme-controllable language model for unsupervised poetry generation, in: Findings of the Association for Computational Linguistics: EMNLP 2022, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 3655–3670. URL: https://aclanthology.org/2022.findings-emnlp.268.
- P. Gervás, A logic programming application for the analysis of spanish verse, in: Computational Logic—CL 2000: First International Conference London, UK, July 24–28, 2000 Proceedings, Springer, 2000, pp. 1330–1344.
- R. Ibrahim, P. Plecháč, Toward automatic analysis of czech verse, Formal methods in poetics (2011) 295–305.
- A. Anttila, R. Heuser, Phonological and metrical variation across genres, in: Proceedings of the Annual Meetings on Phonology, volume 3, 2016.
- M. Agirrezabal, et al., A comparison of feature-based and neural scansion of poetry, in: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, INCOMA Ltd., 2017, pp. 18–23. URL: https://doi.org/10.26615/978-954-452-049-6_003. doi:10.26615/978-954-452-049-6_003.
- Rantanplan, fast and accurate syllabification and scansion of spanish poetry, Procesamiento del Lenguaje Natural 65 (2020) 83–90. URL: https://web.27nov.2021.
- Transformers analyzing poetry: multilingual metrical pattern prediction with transformer-based language models, Neural Computing & Applications (2021). URL: https://doi.org/10.1007/s00521-021-06692-2. doi:10.1007/s00521-021-06692-2.
- P. Jauralde Pou, Métrica española, Madrid: Cátedra (2020).
- A. Pérez Pozo, et al., A bridge too far for artificial intelligence?: Automatic classification of stanzas in spanish poetry, Journal of the Association for Information Science and Technology (2021). URL: https://doi.org/10.1002/asi.24532. doi:10.1002/asi.24532, accessed 10 Dec. 2021.
- Metrical annotation of a large corpus of spanish sonnets: representation, scansion and evaluation, in: International Conference on Language Resources and Evaluation, 2016, pp. 4360–4364.
- B. Navarro-Colorado, A metrical scansion system for fixed-metre spanish poetry, Digital Scholarship in the Humanities 33 (2017) 112–127.
- H. F. Tucker, Poetic data and the news from poems: A" for better for verse" memoir, Victorian Poetry 49 (2011) 267–281.
- T. Haider, J. Kuhn, Supervised rhyme detection with siamese recurrent networks, in: Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2018, pp. 81–86.
- Po-emo: Conceptualization, annotation, and modeling of aesthetic emotions in german and english poetry, arXiv preprint arXiv:2003.07723 (2020).
- Spanish pre-trained bert model and evaluation data, in: PML4DC at ICLR 2020, 2020.
- Unsupervised cross-lingual representation learning at scale, CoRR abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116.
- K. Bobenhausen, The metricalizer2–automated metrical markup of german poetry, Current Trends in Metrical Analysis, Bern: Peter Lang (2011) 119–131.
- The stanford literary lab transhistorical poetry project phase ii: Metrical form., in: DH, 2014.
- Javier de la Rosa (12 papers)
- Álvaro Pérez Pozo (1 paper)
- Salvador Ros (4 papers)
- Elena González-Blanco (3 papers)