Project MOSLA: A Comprehensive Dataset for Second Language Acquisition Research
Introduction
The acquisition of a second language (SLA) incorporates complex, dynamic processes that have been the subject of academic paper across various disciplines. While existing SLA datasets have offered insights, limitations in modality, duration, and comprehensiveness persist. Project MOSLA (Moments of Second Language Acquisition) addresses these limitations by introducing a novel dataset characterized by its longitudinal, multimodal, multilingual nature, and controlled learning environment. This dataset captures over 250 hours of learner-teacher interactions through online lessons in Arabic, Spanish, and Mandarin Chinese, employing a semi-automatic annotation strategy for audio content and incorporating both human and machine learning annotations for speech and language processing.
Dataset Overview and Annotation Process
MOSLA's dataset construction starts with its unique data collection method. Learners participated in weekly online instruction sessions over two years, strictly prohibited from studying their target language outside these meetings to ensure data purity. Each session was recorded using Zoom, covering verbal and non-verbal interactions, teaching material engagement, and proficiency development. This data collection approach resulted in over 250 hours of multimodal content across three languages, setting a new standard for SLA datasets in terms of scope and depth.
The annotation process combined human expertise with machine learning efficiency. Initial annotations were generated by bilingual annotators, targeting a range of linguistic features, including speaker and language identification and speech transcription. These human-generated annotations served as training data for machine learning models, which then expanded annotations to the dataset's entirety. Notably, the fine-tuning of state-of-the-art speech models on this human-annotated subset significantly improved performance, underscoring the dataset's potential to refine machine learning applications in SLA research.
Insights and Applications
Analyses conducted using the MOSLA dataset have already begun to unearth insights into SLA. Numerical results demonstrate the dataset's utility in examining lexical diversity and language use over time, contributing to a richer understanding of language proficiency development. Furthermore, the dataset facilitates innovative research avenues, such as the application of the Matchmap method for understanding focus areas in teaching materials using unannotated multimodal data.
Beyond offering a window into the nuanced processes of language learning, the MOSLA dataset promises to invigorate SLA research across various domains. Its applicability spans proficiency assessment, language processing, educational tool development, and multimodal learning analytics, among others. The dataset's release for research and non-commercial purposes ensures its accessibility to the broader academic community, aiming to catalyze future discoveries in the field.
Ethical Considerations and Conclusion
Project MOSLA integrates rigorous ethical considerations, focusing on participant privacy, fair use of materials, and fair compensation. These measures underscore the project's commitment to responsible research practices. As the dataset becomes a foundational resource for SLA research, ongoing ethical oversight and adherence to agreed-upon usage terms will remain paramount.
In conclusion, Project MOSLA contributes a uniquely comprehensive resource to the field of SLA. By capturing every moment of the learning process across multiple languages and modalities, it offers unprecedented opportunities for research and development. The dataset signposts future directions in linguistics, education technology, and AI, highlighting the collaborative potential across these intersecting fields to enhance our understanding of language learning.