Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Using Letter Positional Probabilities to Assess Word Complexity (2404.07768v4)

Published 11 Apr 2024 in cs.CL

Abstract: Word complexity is defined in a number of different ways. Psycholinguistic, morphological and lexical proxies are often used. Human ratings are also used. The problem here is that these proxies do not measure complexity directly, and human ratings are susceptible to subjective bias. In this study we contend that some form of 'latent complexity' can be approximated by using samples of simple and complex words. We use a sample of 'simple' words from primary school picture books and a sample of 'complex' words from high school and academic settings. In order to analyse the differences between these classes, we look at the letter positional probabilities (LPPs). We find strong statistical associations between several LPPs and complexity. For example, simple words are significantly (p<.001) more likely to start with w, b, s, h, g, k, j, t, y or f, while complex words are significantly (p<.001) more likely to start with i, a, e, r, v, u or d. We find similar strong associations for subsequent letter positions, with 84 letter-position variables in the first 6 positions being significant at the p<.001 level. We then use LPPs as variables in creating a classifier which can classify the two classes with an 83% accuracy. We test these findings using a second data set, with 66 LPPs significant (p<.001) in the first 6 positions common to both datasets. We use these 66 variables to create a classifier that is able to classify a third dataset with an accuracy of 70%. Finally, we create a fourth sample by combining the extreme high and low scoring words generated by three classifiers built on the first three separate datasets and use this sample to build a classifier which has an accuracy of 97%. We use this to score the four levels of English word groups from an ESL program.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Edgar Dale and Jeanne S Chall “A formula for predicting readability: Instructions” In Educational Research Bulletin JSTOR, 1948, pp. 37–54
  2. G Harry Mc Laughlin “SMOG grading-a new readability formula” In Journal of reading 12.8 JSTOR, 1969, pp. 639–646
  3. “Visual word recognition of three-letter words as derived from the recognition of the constituent letters” In Perception & Psychophysics 25.1 Springer, 1979, pp. 12–22
  4. Keith Rayner and Susan A Duffy “Lexical complexity and fixation times in reading: Effects of word frequency, verb complexity, and lexical ambiguity” In Memory & Cognition 14.3 Springer, 1986, pp. 191–201
  5. “Computational analysis of present day American English” RI: Brown University Press, 1988
  6. Michael H Kelly, Ken Springer and Frank C Keil “The relation between syllable number and visual complexity in the acquisition of word meanings” In Memory & Cognition 18.5 Springer, 1990, pp. 528–536
  7. “Identification of Vowel Speech Sounds by Skilled and Less Skilled Readers and the Relation with Vowel Spelling” In Annals of Dyslexia 49 Springer, 1999, pp. 161–194
  8. “Concrete words are easier to recall than abstract words: Evidence for a semantic contribution to short-term serial recall.” In Journal of Experimental Psychology: Learning, Memory, and Cognition 25.5 American Psychological Association, 1999, pp. 1256
  9. Jason D Zevin and Mark S Seidenberg “Age of acquisition effects in word reading and other tasks” In Journal of Memory and Language 47.1 Elsevier, 2002, pp. 1–29
  10. Kevin Larson “The science of word recognition” In Advanced Reading Technology, Microsoft Corporation, 2004
  11. Arturo E Hernandez and Ping Li “Age of acquisition: its neural and computational mechanisms.” In Psychological Bulletin 133.4 American Psychological Association, 2007, pp. 638
  12. Olga V Sysoeva, Inna R Ilyuchenok and Alexey M Ivanitsky “Rapid and slow brain systems of abstract and concrete words differentiation” In International Journal of Psychophysiology 65.3 Elsevier, 2007, pp. 272–283
  13. “Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English” In Behavior Research Methods 41.4 Springer, 2009, pp. 977–990
  14. Anna Mestres-Missé, Thomas F Münte and Antoni Rodriguez-Fornells “Functional neuroanatomy of contextual acquisition of concrete and abstract words” In Journal of Cognitive Neuroscience 21.11 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2009, pp. 2154–2171
  15. NEGISHI Masashi “The development of the CEFR-J: Where we are, where we are going” In New Perspectives for Foreign Language Teaching in Higher Education: Exploring the Possibilities of Application of CEFR. Tokyo: Tokyo University of Foreign Studies, 2012, pp. 10–116
  16. Anna Mestres-Missé, Thomas F Münte and Antoni Rodriguez-Fornells “Mapping concrete and abstract meanings to new words using verbal contexts” In Second Language Research 30.2 Sage Publications Sage UK: London, England, 2014, pp. 191–223
  17. “Size does not matter. Frequency does. A study of features for measuring lexical complexity” In Advances in Artificial Intelligence–IBERAMIA 2014: 14th Ibero-American Conference on AI, Santiago de Chile, Chile, November 24-27, 2014, Proceedings 14, 2014, pp. 129–140 Springer
  18. “Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment” In Quarterly Journal of Experimental Psychology 68.8 SAGE Publications Sage UK: London, England, 2015, pp. 1665–1692
  19. “The impact of word prevalence on lexical decision times: Evidence from the Dutch Lexicon Project 2.” In Journal of Experimental Psychology: Human Perception and Performance 42.3 American Psychological Association, 2016, pp. 441
  20. Molly L Lewis and Michael C Frank “The length of words reflects their conceptual complexity” In Cognition 153 Elsevier, 2016, pp. 182–195
  21. “Test-based age-of-acquisition norms for 44 thousand English word meanings” In Behavior Research Methods 49 Springer, 2017, pp. 1520–1523
  22. Gabriel Harp “Concreteness ratings for 40 thousand generally known English word lemmas” Accessed 28 February 2024, https://github.com/ArtsEngine/concreteness/blob/master/Concreteness_ratings_Brysbaert_et_al_BRM.txt, 2017
  23. Prasad Ostwal “Word Frequency Python Word Lists” [Online; accessed 19-February-2024], https://github.com/ostwalprasad/WordFrequencyPython/tree/master/WordLists, 2018
  24. “Learning Concept Abstractness using Weak Supervision” In arXiv preprint arXiv:1809.01285, 2018
  25. “The concreteness of abstract language: an ancient issue and a new perspective” In Brain Structure and Function 224.4 Springer, 2019, pp. 1385–1401
  26. “An analysis of the features of words that influence vocabulary difficulty” In Education Sciences 9.1 MDPI, 2019, pp. 8
  27. John SY Lee and Chak Yan Yeung “Personalized substitution ranking for lexical simplification” In Proceedings of the 12th International Conference on Natural Language Generation, 2019, pp. 258–267
  28. Adam Leskis “Machine Readable Wordlists/Academic” Accessed 23 February-2024, https://github.com/lpmi-13/machine_readable_wordlists/tree/master/Academic, 2019
  29. Kevin Atkinson “English (US) Hunspell Dictionary (Large)” Accessed 24 March, 2024], https://sourceforge.net/projects/wordlist/files/speller/2020.12.07/, 2020
  30. Yukio Tono “The CEFR-J Wordlist Version 1.6, Tokyo University of Foreign Studies” Accessed 26 March, 2024], https://github.com/openlanguageprofiles/olp-en-cefrj, 2020
  31. Hila Gendler-Shalev, Avivit Ben-David and Rama Novogrodsky “The effect of phonological complexity on the order in which words are acquired in early childhood” In First Language 41.6 SAGE Publications Sage UK: London, England, 2021, pp. 779–793
  32. “One size does not fit all: The case for personalised word complexity models” In arXiv preprint arXiv:2205.02564, 2022
  33. Devin M Kearns and Elfrieda H Hiebert “The word complexity of primary-level texts: Differences between first and third grade in widely used curricula” In Reading Research Quarterly 57.1 Wiley Online Library, 2022, pp. 255–285
  34. Matthew Shardlow, Richard Evans and Marcos Zampieri “Predicting lexical complexity in English texts: the Complex 2.0 dataset” In Language Resources and Evaluation 56.4 Springer, 2022, pp. 1153–1194
  35. Marc Brysbaert “File with various measures” [Online; accessed 28-February-2024], https://osf.io/6kauf, 2023
  36. “The Children’s Picture Books Lexicon (CPB-LEX): A large-scale lexical database from children’s picture books” In Behavior Research Methods Springer, 2023, pp. 1–18
  37. “The Children and Young People’s Books Lexicon (CYP-LEX): A large-scale lexical database of books read by children and young people in the United Kingdom” PsyArXiv, 2023
  38. “Features of lexical complexity: insights from L1 and L2 speakers” In Frontiers in Artificial Intelligence 6 Frontiers Media SA, 2023
  39. “SUBTLEXus - Word Frequency American English” Accessed 28 February 2024, https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/subtlexus2.zip, 2024

Summary

We haven't generated a summary for this paper yet.