Employing self-supervised learning models for cross-linguistic child speech maturity classification (2506.08999v1)

Published 10 Jun 2025 in cs.CL and cs.AI

Abstract: Speech technology systems struggle with many downstream tasks for child speech due to small training corpora and the difficulties that child speech pose. We apply a novel dataset, SpeechMaturity, to state-of-the-art transformer models to address a fundamental classification task: identifying child vocalizations. Unlike previous corpora, our dataset captures maximally ecologically-valid child vocalizations across an unprecedented sample, comprising children acquiring 25+ languages in the U.S., Bolivia, Vanuatu, Papua New Guinea, Solomon Islands, and France. The dataset contains 242,004 labeled vocalizations, magnitudes larger than previous work. Models were trained to distinguish between cry, laughter, mature (consonant+vowel), and immature speech (just consonant or vowel). Models trained on the dataset outperform state-of-the-art models trained on previous datasets, achieved classification accuracy comparable to humans, and were robust across rural and urban settings.

Summary

The paper introduces a novel approach using self-supervised learning models to accurately classify child speech maturity across diverse languages.
It leverages transformer-based architectures fine-tuned on the extensive SpeechMaturity dataset, achieving an Unweighted Average Recall of 74.2%.
The study demonstrates robust performance across urban and rural environments, highlighting its potential for early detection of developmental speech disorders.

Cross-Linguistic Child Speech Maturity Classification Using Self-Supervised Learning Models

The paper "Employing self-supervised learning models for cross-linguistic child speech maturity classification" presents a significant advancement in the field of child speech recognition technology. This paper addresses the intricate challenge of classifying child vocalizations across a diverse range of languages and acoustic environments by leveraging self-supervised learning (SSL) models, particular transformer-based architectures, to deliver improved classification accuracy over previous methodologies.

The researchers introduce a dataset named SpeechMaturity, comprising 242,004 labeled vocalizations captured from children acquiring over 25 languages in various geographically and environmentally diverse regions, including the U.S., Bolivia, Vanuatu, Papua New Guinea, Solomon Islands, and France. This dataset, magnitudes larger than its predecessors such as BabbleCorpus, offers unprecedented ecological validity by encompassing both urban and rural settings in different cultures and linguistic contexts. The large-scale corpus allows models to be trained on variations of child speech in order to decode maturational stages with enhanced accuracy.

The paper evaluates three variants of Wav2Vec2 models: W2V2-base, W2V2-LL4300h, and W2V2-LL4300-Pro. These models are fine-tuned to classify child speech into five categories: cry, laughter, canonical syllables (mature speech), non-canonical syllables (immature speech), and other/junk sounds. The performance of these models significantly surpasses existing benchmarks, achieving classification accuracy comparable to human annotators, with W2V2-LL4300-Pro refined on SpeechMaturity exhibiting the highest accuracy.

In terms of numerical results, the W2V2-LL4300-Pro model fine-tuned on SpeechMaturity-Cleaned achieves an Unweighted Average Recall (UAR) of 74.2%, noticeably outperforming past models tested on BabbleCorpus, which reported UARs in the range of 59.5% to 64.6%. These results establish an infrastructural improvement in handling large-scale, cross-linguistic child speech datasets using SSL.

Moreover, the paper demonstrates the robustness of these SSL models. The classification performance remains consistent across distinct acoustic environments, achieving a UAR of 70.7% in urban areas and 67.8% in rural settings. This suggests an impressive capability of these models to generalize across diverse ecological conditions, which could have important practical applications, such as early detection of speech disorders across varied linguistic and cultural backgrounds.

Implications of this research are extensive. The successful application of SSL models for child speech classification paves the way for broader deployment in clinical and educational settings, enabling the identification of children at risk of developmental disorders early on. Furthermore, tapping into the diverse linguistic data presented by the SpeechMaturity corpus can enhance speech-related technologies' adaptability and inclusivity in global contexts. The paper also underscores the potential future advancements in artificial intelligence, particularly in enriching communicative strategies and technologies in early childhood settings.

In conclusion, this paper enriches the landscape of child speech recognition by integrating diverse, cross-cultural datasets with advanced SSL models, pushing the boundaries of speech maturity classification and facilitating the exploration of novel avenues in AI applications to linguistics and cognitive development.