PhonologyBench: Evaluating Phonological Skills of Large Language Models (2404.02456v2)
Abstract: Phonology, the study of speech's structure and pronunciation rules, is a critical yet often overlooked component in LLM research. LLMs are widely used in various downstream applications that leverage phonology such as educational tools and poetry generation. Moreover, LLMs can potentially learn imperfect associations between orthographic and phonological forms from the training data. Thus, it is imperative to benchmark the phonological skills of LLMs. To this end, we present PhonologyBench, a novel benchmark consisting of three diagnostic tasks designed to explicitly test the phonological skills of LLMs in English: grapheme-to-phoneme conversion, syllable counting, and rhyme word generation. Despite having no access to speech data, LLMs showcased notable performance on the PhonologyBench tasks. However, we observe a significant gap of 17% and 45% on Rhyme Word Generation and Syllable counting, respectively, when compared to humans. Our findings underscore the importance of studying LLM performance on phonological tasks that inadvertently impact real-world applications. Furthermore, we encourage researchers to choose LLMs that perform well on the phonological task that is closely related to the downstream application since we find that no single model consistently outperforms the others on all the tasks.
- Anthropic. Introducing the next generation of claude. https://www.anthropic.com/news/claude-3-family, 2024.
- Results of the second sigmorphon shared task on multilingual grapheme-to-phoneme conversion. In Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 115–125, 2021.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity, 2023.
- Chatgpt and simple linguistic inferences: Blind spots and blinds, 2023.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- Large linguistic models: Analyzing theoretical linguistic abilities of llms, 2023.
- Joint-sequence models for grapheme-to-phoneme conversion. Speech communication, 50(5):434–451, 2008.
- Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. IEEE transactions on pattern analysis and machine intelligence, 44(11):7327–7347, 2021.
- Phonological awareness in young second language learners. Journal of child Language, 22(2):307–324, 1995.
- David Caplan. Rhyme’s challenge: hip hop, poetry, and contemporary rhyming culture. Oxford University Press, USA, 2014.
- Multilingual multimodal machine translation for dravidian languages utilizing phonetic transcription. In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, pp. 56–63, 2019.
- Songcomposer: A large language model for lyric and melody composition in song generation, 2024.
- What’s in my big data?, 2024.
- Pronunciation of proper names with a joint n-gram model for bi-directional grapheme-to-phoneme conversion. In Seventh International Conference on Spoken Language Processing, 2002.
- Harold G Henderson. Haiku in english. 1965.
- Prompt-based methods may underestimate large language models’ linguistic generalizations, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts, 2024.
- Is chatgpt a good translator? yes with gpt-4 as the engine, 2023.
- ChatGPT: Jack of all trades, master of none. Information Fusion, pp. 101861, jun 2023. doi: 10.1016/j.inffus.2023.101861. URL https://doi.org/10.1016%2Fj.inffus.2023.101861.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- Taeahn Kwon. Interfaces for Personalized Language Learning with Generative Language Models. PhD thesis, Columbia University, 2023.
- Holistic evaluation of language models, 2022.
- Robust neural machine translation with joint textual and phonetic embedding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3044–3049, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1291. URL https://aclanthology.org/P19-1291.
- James Maguire. American bee: The national spelling bee and the culture of word nerds. Rodale Books, 2006.
- Phoneme-to-grapheme conversion based large-scale pre-training for end-to-end automatic speech recognition. In INTERSPEECH, pp. 2822–2826, 2020.
- Jaan Mikk. Sentence length for revealing the cognitive load reversal effect in text comprehension. Educational Studies, 34(2):119–127, 2008.
- OpenAI. Introducing chatgpt. 2022. URL https://openai.com/blog/chatgpt.
- OpenAI. Gpt-4 technical report, 2023.
- Poelm: A meter-and rhyme-controllable language model for unsupervised poetry generation. arXiv preprint arXiv:2205.12206, 2022.
- g2pe. https://github.com/Kyubyong/g2p, 2019.
- Spoken language intelligence of large language models for language learning, 2023.
- Is chatgpt a general-purpose natural language processing task solver?, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022.
- An introduction to deep generative modeling. GAMM-Mitteilungen, 44(2):e202100008, 2021.
- Bits of grass: Does gpt already know how to write like whitman? arXiv preprint arXiv:2305.11064, 2023.
- Byte pair encoding: A text compression scheme that accelerates pattern matching. 1999.
- Evaluating large language models on controlled generation tasks, 2023.
- Seiichi Suzuki. The role of syllable structure in old english poetry. Lingua, 67(2-3):97–119, 1985.
- Zero-shot sonnet generation with discourse-level planning and aesthetics features, 2022.
- Unsupervised melody-to-lyric generation, 2023.
- Jointly learning to align and convert graphemes to phonemes with neural attention models. In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 76–82. IEEE, 2016.
- The effects of syllable instruction on phonemic awareness in preschoolers. Early childhood research quarterly, 26(1):50–60, 2011.
- Large language models still can’t plan (a benchmark for llms on planning and reasoning about change), 2023.
- Vice. Chatgpt is pretty bad at poetry, according to poets. https://www.vice.com/en/article/7kx9d9/chatgpt-is-pretty-bad-at-poetry-according-to-poets, 2023.
- Transformer based grapheme-to-phoneme conversion. In Interspeech 2019, interspeech2019. ISCA, September 2019. doi: 10.21437/interspeech.2019-1954. URL http://dx.doi.org/10.21437/Interspeech.2019-1954.
- Token-free llms can generate chinese classical poetry with more accurate format. arXiv preprint arXiv:2401.03512, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.