Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning (2503.04611v1)

Published 6 Mar 2025 in cs.CL

Abstract: In this work, we explain our approach employed in the BabyLM Challenge, which uses various methods of training LMs with significantly less data compared to traditional LLMs and are inspired by how human children learn. While a human child is exposed to far less linguistic input than an LLM, they still achieve remarkable language understanding and generation abilities. To this end, we develop a model trained on a curated dataset consisting of 10 million words, primarily sourced from child-directed transcripts. The 2024 BabyLM Challenge initial dataset of 10M words is filtered to 8.5M. Next, it is supplemented with a randomly selected subset of TVR dataset consisting of 1.5M words of television dialogues. The latter dataset ensures that similar to children, the model is also exposed to language through media. Furthermore, we reduce the vocabulary size to 32,000 tokens, aligning it with the limited vocabulary of children in the early stages of language acquisition. We use curriculum learning and is able to match the baseline on certain benchmarks while surpassing the baseline on others. Additionally, incorporating common LLM training datasets, such as MADLAD-400, degrades performance. These findings underscore the importance of dataset selection, vocabulary scaling, and curriculum learning in creating more data-efficient LLMs that better mimic human learning processes.

Summary

The paper introduces a child-inspired language model training framework that reduces dataset size and vocabulary requirements while maintaining competitive performance.
It employs a curated 8.5M-word dataset supplemented with television dialogues and integrates curriculum learning to gradually introduce linguistic complexity.
Results demonstrate that precise vocabulary scaling and structured input ordering enhance benchmark performance on tasks like BLIMP and GLUE.

Overview of Data-Efficient LLMs Inspired by Child Language Acquisition

This paper offers a comprehensive analysis of an innovative approach to enhancing data efficiency in LMs by mimicking the language acquisition processes of human children. The researchers challenge the conventional LM training methodologies, which often necessitate extensive datasets, and present a nuanced training framework that employs less linguistic input while achieving competent performance. This paper forms part of the BabyLM Challenge and delivers several key methodologies and experimental insights.

Methodology and Dataset Preparation

Dataset Curation

The authors first focus on crafting a specialized 10 million-word dataset, primarily derived from child-directed transcripts. This dataset undergoes rigorous filtering to retain only 8.5 million words, augmented by an additional 1.5 million words sourced from television dialogues. This supplementation is intended to simulate the media-based language exposure that significantly influences modern children's language development.

Vocabulary Scaling

In alignment with a child's restricted lexical scope during early language learning stages, the vocabulary size is intentionally limited to 32,000 tokens. This constraint aims to encourage efficient representation learning and generalization strategies within the LM, paralleling early human language acquisition.

Model Architecture

Employing the SmolLM model, configured with a decoder-only Transformer architecture encompassing 125 million parameters and trained over 5 epochs, maintains a compact model size conducive to exploiting limited resources. This choice elucidates a practical exploration of achieving quality LM performance minus the demands of extensive computational power and datasets.

Curriculum Learning

Finally, curriculum learning integrates structured data inputs, gradually introducing increasing complexity into the training process. This method employs scoring functions, like word count and average word length, to categorize and sort the dataset. The scores dictate the learning schedule, paralleling natural human learning hierarchies, and reinforce the model's foundation in simpler linguistic structures before advancing to more intricate ones.

Experimental Evaluations

Impact of Television Dialogues

Incorporating television data reveals significant performance gains, especially on benchmark tasks such as BLIMP and BLIMP Supplement. The diversified linguistic input from television dialogues provides valuable breadth, enhancing model capabilities across the board. Curiously, the inclusion of high-quality datasets like MADLAD-400 results in performance degradation, spotlighting the nuanced nature of data selection and its criticality in low-resource conditions.

Effects of Vocabulary Size

A marked performance optimization is observed with the 32,000-token vocabulary size, illustrating its importance as a fundamental hyperparameter. Both smaller and larger vocabulary configurations were suboptimal, suggesting that precise vocabulary tailoring can significantly impact data-efficient LM training.

Curriculum Learning Efficacy

Implementing curriculum learning notably improved performance by enabling the model to progress from simpler to more complex linguistic challenges systematically, thereby fostering robust language understanding. This learning approach emerges as a potent strategy for optimizing LM training efficiency and effectiveness.

Results and Comparative Analysis

The developed model consistently meets or surpasses baseline performance across varied evaluation metrics, particularly showcasing strengths in GLUE and BLIMP benchmarks. This comparison clearly establishes the competitive standing of the proposed methodological framework against established models.

Discussion and Future Implications

The paper posits potential exploration avenues, notably in sophisticated data valuation techniques that transcend traditional selection methods, potentially leading to more impactful data curation. By embracing methods such as Influence functions or dynamic approaches like TracIn, future research could refine data selection processes, emphasizing quality over quantity in LM training inputs. These advancements might set precedents for data efficiency in machine learning and AI.

Conclusion

The approach delineated in this paper demonstrates viable pathways for crafting data-efficient LMs inspired by human learning. Through systematic dataset customization, vocabulary scaling, and curriculum-based training, the paper not only contributes to NLP innovations but also extends implications for cognitive science. These findings illustrate a substantive step forward in the dialogue between human and artificial language acquisition models, underscoring possibilities for more resource-efficient, cognitively coherent future AI systems.