Are Sounds Sound for Phylogenetic Reconstruction?

Published 5 Feb 2024 in cs.CL, cs.SD, and eess.AS | (2402.02807v3)

Abstract: In traditional studies on language evolution, scholars often emphasize the importance of sound laws and sound correspondences for phylogenetic inference of language family trees. However, to date, computational approaches have typically not taken this potential into account. Most computational studies still rely on lexical cognates as major data source for phylogenetic reconstruction in linguistics, although there do exist a few studies in which authors praise the benefits of comparing words at the level of sound sequences. Building on (a) ten diverse datasets from different language families, and (b) state-of-the-art methods for automated cognate and sound correspondence detection, we test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average, to the gold standard phylogenies than phylogenies reconstructed from sound correspondences.

Abstract PDF Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper shows that cognate-based methods reconstruct language phylogenies approximately one third closer to expert classifications than sound-based methods.
It uses Bayesian Inference and Maximum Likelihood on ten gold standard datasets to rigorously compare sound-based and cognate-based approaches.
It identifies unique challenges in Bayesian prior selection due to bi-modal α values in linguistic data, prompting a re-evaluation of standard analysis methods.

Introduction

The study of language evolution and the methods used to reconstruct phylogenetic relationships among languages have progressed significantly with the advent of computational techniques. While lexical cognates have commonly been the mainstay of such studies, sound correspondences have often been cited by historical linguists for subgrouping languages. Moreover, the debate on which data type—lexical cognates or sound sequences—better supports the accurate reconstruction of language phylogenies remains open. In addressing this debate, the paper in question rigorously evaluates the efficacy of sound-based vs. cognate-based approaches for phylogenetic reconstruction using a collection of diverse language datasets.

Methodology

Integral to this research are state-of-the-art methods for automated cognate detection alongside novel techniques for sound correspondence pattern inference in multilingual datasets. The utilization of such methods is set against ten specially curated gold standard datasets. Phylogenetic trees inferred from these methods were compared to benchmark trees established by the anthropological linguistic community. The paper explores two inferential approaches: Bayesian Inference and Maximum Likelihood (ML). The analysis is comprehensive, employing both an individual examination of cognate and sound correspondence datasets and a combination of both. The Bayesian analyses were particularly attentive to the prior settings for α values, revealing notable discrepancies when using default molecular priors, thereby prompting a critical reassessment of these priors for language data analysis.

Results

One of the notable outcomes is that phylogenies reconstructed from lexical cognates were, on average, topologically closer by approximately one third with respect to generalized quartet distance (GQD) to expert-derived classification trees, compared to those inferred from sound correspondences. Furthermore, the findings indicate that Bayesian Inference and ML analyses produced consistent results—supporting the effectiveness of cognate-based over sound-based reconstructions, while the combined dataset did not conclusively outperform the cognate dataset. The study also underscores an unusual bi-modal distribution of α values (indicating the degree of among-site rate heterogeneity) not typically observed in molecular datasets, suggesting unique challenges associated with linguistic data.

Discussion

The research presents a nuanced view on the utility of computational tools for language phylogeny. It concludes that while sound correspondence-based phylogenies cannot be disregarded, they appear to be less reliable than cognate-based phylogenies. The study delivers a clear verdict in favor of lexical cognates as a stronger basis for phylogenetic reconstruction. Additionally, the paper exhorts further scrutiny on the customarily used priors in Bayesian analyses of language data, invigorating a methodological dialogue that extends beyond the study of languages and into the implications for computational and evolutionary biology. The supplementary material attached to the study ensures reproducibility, reflecting the transparent and structured character of the research. Overall, the findings are instrumental in informing the direction of future studies bridging computational linguistics and historical linguistics.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Are Sounds Sound for Phylogenetic Reconstruction?

Summary

Introduction

Methodology

Results

Discussion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (5)

Collections

Tweets

Are Sounds Sound for Phylogenetic Reconstruction?

Summary

Introduction

Methodology

Results

Discussion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

Tweets