Project MOSLA: Recording Every Moment of Second Language Acquisition

Published 26 Mar 2024 in cs.CL | (2403.17314v1)

Abstract: Second language acquisition (SLA) is a complex and dynamic process. Many SLA studies that have attempted to record and analyze this process have typically focused on a single modality (e.g., textual output of learners), covered only a short period of time, and/or lacked control (e.g., failed to capture every aspect of the learning process). In Project MOSLA (Moments of Second Language Acquisition), we have created a longitudinal, multimodal, multilingual, and controlled dataset by inviting participants to learn one of three target languages (Arabic, Spanish, and Chinese) from scratch over a span of two years, exclusively through online instruction, and recording every lesson using Zoom. The dataset is semi-automatically annotated with speaker/language IDs and transcripts by both human annotators and fine-tuned state-of-the-art speech models. Our experiments reveal linguistic insights into learners' proficiency development over time, as well as the potential for automatically detecting the areas of focus on the screen purely from the unannotated multimodal data. Our dataset is freely available for research purposes and can serve as a valuable resource for a wide range of applications, including but not limited to SLA, proficiency assessment, language and speech processing, pedagogy, and multimodal learning analytics.

Abstract PDF HTML Upgrade to Chat

Authors (2)

Summary

The paper introduces a novel dataset that captures over 250 hours of controlled, multimodal online language instruction in Arabic, Spanish, and Mandarin Chinese.
It employs a semi-automatic annotation strategy that combines human expertise with machine learning to enhance speech and language processing.
The dataset provides actionable insights into lexical diversity and language proficiency development, enabling innovative research in educational technology and AI.

Project MOSLA: A Comprehensive Dataset for Second Language Acquisition Research

Introduction

The acquisition of a second language (SLA) incorporates complex, dynamic processes that have been the subject of academic study across various disciplines. While existing SLA datasets have offered insights, limitations in modality, duration, and comprehensiveness persist. Project MOSLA (Moments of Second Language Acquisition) addresses these limitations by introducing a novel dataset characterized by its longitudinal, multimodal, multilingual nature, and controlled learning environment. This dataset captures over 250 hours of learner-teacher interactions through online lessons in Arabic, Spanish, and Mandarin Chinese, employing a semi-automatic annotation strategy for audio content and incorporating both human and machine learning annotations for speech and language processing.

Dataset Overview and Annotation Process

MOSLA's dataset construction starts with its unique data collection method. Learners participated in weekly online instruction sessions over two years, strictly prohibited from studying their target language outside these meetings to ensure data purity. Each session was recorded using Zoom, covering verbal and non-verbal interactions, teaching material engagement, and proficiency development. This data collection approach resulted in over 250 hours of multimodal content across three languages, setting a new standard for SLA datasets in terms of scope and depth.

The annotation process combined human expertise with machine learning efficiency. Initial annotations were generated by bilingual annotators, targeting a range of linguistic features, including speaker and language identification and speech transcription. These human-generated annotations served as training data for machine learning models, which then expanded annotations to the dataset's entirety. Notably, the fine-tuning of state-of-the-art speech models on this human-annotated subset significantly improved performance, underscoring the dataset's potential to refine machine learning applications in SLA research.

Insights and Applications

Analyses conducted using the MOSLA dataset have already begun to unearth insights into SLA. Numerical results demonstrate the dataset's utility in examining lexical diversity and language use over time, contributing to a richer understanding of language proficiency development. Furthermore, the dataset facilitates innovative research avenues, such as the application of the Matchmap method for understanding focus areas in teaching materials using unannotated multimodal data.

Beyond offering a window into the nuanced processes of language learning, the MOSLA dataset promises to invigorate SLA research across various domains. Its applicability spans proficiency assessment, language processing, educational tool development, and multimodal learning analytics, among others. The dataset's release for research and non-commercial purposes ensures its accessibility to the broader academic community, aiming to catalyze future discoveries in the field.

Ethical Considerations and Conclusion

Project MOSLA integrates rigorous ethical considerations, focusing on participant privacy, fair use of materials, and fair compensation. These measures underscore the project's commitment to responsible research practices. As the dataset becomes a foundational resource for SLA research, ongoing ethical oversight and adherence to agreed-upon usage terms will remain paramount.

In conclusion, Project MOSLA contributes a uniquely comprehensive resource to the field of SLA. By capturing every moment of the learning process across multiple languages and modalities, it offers unprecedented opportunities for research and development. The dataset signposts future directions in linguistics, education technology, and AI, highlighting the collaborative potential across these intersecting fields to enhance our understanding of language learning.

Markdown Report Issue