Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring language relations through syntactic distances and geographic proximity (2403.18430v2)

Published 27 Mar 2024 in cs.CL, physics.data-an, physics.soc-ph, and stat.AP

Abstract: Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. Within an information-theoretic framework, we show that employing POS trigrams maximizes the possibility of capturing syntactic variations while being at the same time compatible with the amount of available data. Linguistic connections are then established by assessing pairwise distances based on the POS distributions. Intriguingly, our analysis reveals definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies. Furthermore, we obtain a significant correlation between language similarity and geographic distance, which underscores the influence of spatial proximity on language kinships.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. M. Hale, Historical Linguistics: Theory and Method (Backwell Publishing, Hoboken, New Jersey, 2007).
  2. M. Durie and M. Ross, The comparative method reviewed: Regularity and irregularity in language change (Oxford University Press, Oxford, 1996).
  3. R. D. Gray and Q. D. Atkinson, “Language-tree divergence times support the Anatolian theory of Indo-European origin,” Nature 426, 435–439 (2003).
  4. R. D. Gray, A. J. Drummond,  and S. J. Greenhill, “Language phylogenies reveal expansion pulses and pauses in Pacific settlement,” Science 323, 479–483 (2009).
  5. S. J. Greenhill, “Language phylogenies: modelling the evolution of language,” in The Oxford Handbook of Cultural Evolution (Oxford Academic, 2023).
  6. F. de Saussure, Course in General Linguistics (Columbia University Press, New York, 2011).
  7. M. Serva and F. Petroni, “Indo-European languages tree by Levenshtein distance,” Europhysics letters 81, 68005 (2008).
  8. E. W. Holman, C. H. Brown, S. Wichmann, A. Müller, V. Velupillai, H. Hammarström, S. Sauppe, H. Jung, D. Bakker, P. Brown, et al., “Automated dating of the world’s language families based on lexical similarity,” Current Anthropology 52, 841–875 (2011).
  9. J. Nerbonne, “Data-driven dialectology,” Language and linguistics compass 3, 175–198 (2009).
  10. B. R. Chiswick and P. W. Miller, “Linguistic distance: A quantitative measure of the distance between english and other languages,” Journal of multilingual and multicultural development 26, 1–11 (2005).
  11. J. Mira and Á. Paredes, “Interlinguistic similarity and language death dynamics,” Europhysics Letters 69, 1031 (2005).
  12. C. Fernando, R.-L. Valijärvi,  and R. A. Goldstein, “A model of the mechanisms of language extinction and revitalization strategies to save endangered languages,” Human biology 82, 47–75 (2010).
  13. J. Nerbonne and W. Heeringa, “Measuring dialect distance phonetically,” in Computational phonology: third meeting of the acl special interest group in computational phonology (1997).
  14. S. S. Downey, B. Hallmark, M. P. Cox, P. Norquest,  and J. S. Lansing, “Computational feature-sensitive reconstruction of language relationships: Developing the aline distance for comparative historical linguistic reconstruction,” Journal of Quantitative Linguistics 15, 340–369 (2008).
  15. W. Heeringa, J. Golubovic, C. Gooskens, A. Schüppert, F. Swarte,  and S. Voigt, “Lexical and orthographic distances between Germanic, Romance and Slavic languages and their relationship to geographic distance,” in Phonetics in Europe, edited by C. Gooskens and R. van Bezooijen (P.I.E. - Peter Lang, 2013) pp. 99–137.
  16. G. Donoso and D. Sánchez, “Dialectometric analysis of language variation in Twitter,” in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), edited by P. Nakov, M. Zampieri, N. Ljubešić, J. Tiedemann, S. Malmasi,  and A. Ali (Association for Computational Linguistics, Valencia, Spain, 2017) pp. 16–25.
  17. P. Gamallo, J. R. Pichel,  and I. Alegria, “From language identification to language distance,” Physica A: Statistical Mechanics and its Applications 484, 152–162 (2017).
  18. S. E. Eden, Measuring phonological distance between languages, Ph.D. thesis, UCL (University College London) (2018).
  19. N. C. Sanders, A Statistical Method for Syntactic Dialectometry (Indiana University, 2010).
  20. G. Longobardi, C. Guardiano, G. Silvestri, A. Boattini,  and A. Ceolin, “Toward a syntactic phylogeny of modern Indo-European languages,” Journal of Historical Linguistics 3, 122–152 (2013).
  21. J. Dunn, “Global syntactic variation in seven languages: Toward a computational dialectology,” Frontiers in Artificial Intelligence 2, 15 (2019).
  22. C. Manning and H. Schutze, Foundations of statistical natural language processing (MIT press, Cambridge, 1999).
  23. S. Feldman, M. A. Marin, M. Ostendorf,  and M. R. Gupta, “Part-of-speech histograms for genre classification of text,” in 2009 IEEE international conference on acoustics, speech and signal processing (IEEE, 2009) pp. 4781–4784.
  24. E. Rabinovich, N. Ordan,  and S. Wintner, “Found in translation: Reconstructing phylogenetic language trees from translations,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2017) pp. 530–540.
  25. A. Samohi, D. W. Mitelman,  and K. Bar, “Using cross-lingual part of speech tagging for partially reconstructing the classic language family tree model,” in Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change (2022) pp. 78–88.
  26. D. Zeman et al., “Universal dependencies 2.13,”  (2023), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
  27. M.-C. De Marneffe, C. D. Manning, J. Nivre,  and D. Zeman, “Universal dependencies,” Computational linguistics 47, 255–308 (2021).
  28. M. S. Dryer and M. Haspelmath, eds., WALS Online (v2020.3) (Zenodo, 2013).
  29. B. Comrie, Language universals and linguistic typology: Syntax and morphology (University of Chicago press, 1989).
  30. J. P. Crutchfield and D. P. Feldman, “Regularities unseen, randomness observed: Levels of entropy convergence,” Chaos: An Interdisciplinary Journal of Nonlinear Science 13, 25–54 (2003).
  31. C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal 27, 379–423 (1948).
  32. A. E. Raftery, “A model for high-order Markov chains,” Journal of the Royal Statistical Society: Series B (Methodological) 47, 528–539 (1985).
  33. J. De Gregorio, D. Sánchez,  and R. Toral, “An improved estimator of Shannon entropy with applications to systems with memory,” Chaos, Solitons & Fractals 165, 112797 (2022).
  34. L. Paninski, “Estimation of entropy and mutual information,” Neural Computation 15, 1191–1253 (2003).
  35. L. Contreras Rodríguez, E. J. Madarro-Capó, C. M. Legón-Pérez, O. Rojas,  and G. Sosa-Gómez, “Selecting an effective entropy estimator for short sequences of bits and bytes with maximum entropy,” Entropy 23, 561 (2021).
  36. I. Nemenman, F. Shafee,  and W. Bialek, “Entropy and inference, revisited,” in Advances in Neural Information Processing Systems, Vol. 14, edited by T. Dietterich, S. Becker,  and Z. Ghahramani (MIT Press, 2001).
  37. I. Nemenman, W. Bialek,  and R. de Ruyter van Steveninck, “Entropy and information in neural spike trains: Progress on the sampling problem,” Phys. Rev. E 69, 056111 (2004).
  38. J. De Gregorio, D. Sánchez,  and R. Toral, “Entropy estimators for Markovian sequences: A comparative analysis,” Entropy 26 (2024), 10.3390/e26010079.
  39. D. Endres and J. Schindelin, “A new metric for probability distributions,” IEEE Transactions on Information Theory 49, 1858–1860 (2003).
  40. F. Nielsen, “Hierarchical clustering,” in Introduction to HPC with MPI for Data Science (Springer International Publishing, Cham, 2016) pp. 195–211.
  41. D. Defays, “An efficient algorithm for a complete link method,” The Computer Journal 20, 364–366 (1977).
  42. L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis (John Wiley & Sons, 2009).
  43. P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics 20, 53–65 (1987).
  44. J. C. Gower and G. J. Ross, “Minimum spanning trees and single linkage cluster analysis,” Journal of the Royal Statistical Society: Series C (Applied Statistics) 18, 54–64 (1969).
  45. T. Kamada, S. Kawai, et al., “An algorithm for drawing general undirected graphs,” Information processing letters 31, 7–15 (1989).
  46. A. Haselow, Typological changes in the lexicon: Analytic tendencies in English noun formation, Vol. 72 (Walter de Gruyter, 2011).
  47. O. Gensler, A typological evaluation of Celtic/Hamito-Semitic syntactic parallels, Ph.D. thesis, University of California (1993).
  48. A. Tamrazian, The syntax of Armenian: Chains and the auxiliary, Ph.D. thesis, University of London, University College London (United Kingdom) (1994).
  49. J. A. Janhunen, “The unity and diversity of Altaic,” Annual Review of Linguistics 9, 135–154 (2023).
  50. F. Hartmann and G. Walkden, “The strength of the phylogenetic signal in syntactic data,” Glossa: a journal of general linguistics 9, 1–25 (2024).
  51. G. J. Székely, M. L. Rizzo,  and N. K. Bakirov, “Measuring and testing dependence by correlation of distances,” The Annals of Statistics 35, 2769 – 2794 (2007).
  52. J. Nerbonne, “Measuring the diffusion of linguistic change,” Philosophical Transactions of the Royal Society B: Biological Sciences 365, 3821–3828 (2010).
  53. G. Jäger, “Global-scale phylogenetic linguistic inference from lexical resources,” Scientific Data 5, 1–16 (2018).
  54. T. Cover and J. Thomas, Elements of Information Theory (John Wiley and Sons, 2006).
  55. E. G. Altmann, G. Cristadoro,  and M. D. Esposti, “On the origin of long-range correlations in texts,” Proceedings of the National Academy of Sciences 109, 11582–11587 (2012).

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com