Papers
Topics
Authors
Recent
2000 character limit reached

BabyLM Challenge 2024 Dataset

Updated 19 November 2025
  • BabyLM Challenge 2024 Dataset is a standardized, developmentally plausible corpus curated to mimic child language input with a strict 100M token budget.
  • It integrates text-only and multimodal tracks by combining child-directed speech, children's literature, and image-caption pairs for comprehensive cognitive and multimodal benchmarking.
  • The dataset enforces strict data budgets and rigorous preprocessing, enabling research on data-efficient language modeling and robust model evaluation.

The BabyLM Challenge 2024 dataset is a standardized, developmentally plausible corpus for pretraining data-efficient LLMs under strict resource ceilings designed to approximate the linguistic experience of human children up to early adolescence. It serves as the foundation for a community-driven benchmark emphasizing cognitively plausible language modeling, with structured tracks for text-only and multimodal (vision–language) learning. The dataset is meticulously constructed from child-relevant domains, with quantitative controls on size, source selection, domain balance, and preprocessing, facilitating rigorous comparisons of model architectures, objectives, and training paradigms (Warstadt et al., 10 Apr 2025, Hu et al., 2024, Choshen et al., 2024).

1. Dataset Design Principles and Rationale

The BabyLM 2024 dataset operationalizes developmentally plausible pretraining by enforcing a total word-token budget of 100 million words (Strict Track), grounded in empirical estimates of child linguistic input exposure—approximately 2–7 million word tokens per year, accumulating to ~84 million by age 12 (Gilkerson et al., 2017). The stricture on data volume addresses the vast data-inefficiency gap between contemporary LLMs and human learners, providing a tractable, standardized corpus for cognitive and NLP research (Warstadt et al., 10 Apr 2025, Warstadt et al., 2023).

Source selection is driven by two central criteria: plausibility of exposure to a child (emphasizing transcribed speech and child-directed or child-appropriate text) and sufficient diversity of syntactic/semantic constructions. The dataset deliberately excludes domains such as news, academic text, or web forums to maintain alignment with child language acquisition scenarios (Hu et al., 2024).

A key innovation in 2024 is the introduction of a vision-and-language multimodal track, combining 50 million text-only word tokens with 50 million image-captioned word tokens, thereby enabling research into sample-efficient multimodal pretraining (Choshen et al., 2024, Hu et al., 2024).

2. Corpus Composition and Sources

The 2024 BabyLM baseline corpus consists of two major components:

A. Text-Only Pretraining Corpus (Strict Track, 100 M words)

This component is assembled from six English-language data sources, summarized below (percentages by word count):

Source Domain Words (M) % of 100 M
CHILDES (MacWhinney, 2000) Child-directed speech 29 29%
BNC Spoken Dialogue Adult spoken dialogue 8 8%
Project Gutenberg (children’s books) Written children’s literature 26 26%
OpenSubtitles Movie/TV subtitles 20 20%
Simple English Wikipedia Encyclopedia text, simplified 15 15%
Switchboard Dialog Act Telephone dialogue 1 1%

This structure yields a text corpus in which child-oriented content (CHILDES, children’s books, Simple English Wikipedia) constitutes 70% of tokens, and speech-derived material (CHILDES, BNC, OpenSubtitles, Switchboard) covers 58% (Hu et al., 2024). The corpus provides train/dev/test splits on a per-source basis (83.3%/8.3%/8.3% by word count), and a 10 M-word “Strict-Small” track is generated via random downsampling of the Strict split (Hu et al., 2024, Warstadt et al., 10 Apr 2025).

B. Multimodal Vision–Language Corpus (100 M words)

This component comprises:

  • 50 M tokens: Stratified sample from the text-only corpus, mirroring the domain proportions above.
  • 50 M tokens: Captioned data drawn from two image–text sources:
    • Localized Narratives (27 M words, ~0.6 M images)
    • Conceptual Captions 3M (23 M words, ~2.3 M images)

Resulting in ≈2.9 million image–caption pairs, each associated with a single caption and image (Hu et al., 2024, Choshen et al., 2024).

3. Data Preparation, Preprocessing, and Release Structure

Dataset curation emphasizes minimal intervention. Preprocessing steps include:

  • Stripping of XML/annotation tags and document-level metadata.
  • Removal of duplicate documents (notably in OpenSubtitles).
  • Preservation of original newlines—no forced sentence/paragraph segmentation.
  • No spelling normalization, downcasing, or forced tokenization; participants construct their own tokenizers (BPE, WordPiece).
  • For Project Gutenberg, filtering to English texts and authors born after 1850 (Hu et al., 2024, Warstadt et al., 10 Apr 2025).

Each source is split into train/dev/test partitions by random sampling of minimally processed “chunks” (≥2,000 lines) to maximize document coherence within splits. The dataset is distributed as raw UTF-8 text files, accompanied by scripts and datasheets for transparency and reproducibility (Choshen et al., 2024).

In the multimodal vision–language track, images are released at native resolution with no resizing or feature extraction applied. Download scripts and precomputed DINOv2 embeddings are provided for reproducibility. Caption vocabularies are not standardized across text-only and image–caption sources; overlap analysis is possible via intersection of word types (Hu et al., 2024).

4. Track-Based Data Budgets and Usage Rules

Three tracks regulate the flow of linguistic input:

  1. Strict Track:
    • Budget:

    iN(i)=100×106 words\sum_i N_{(i)} = 100\times 10^6\ \text{words}

  • All trainable components (tokenizer, reranker, etc.) must remain within the 100 M-word ceiling.
  1. Strict-Small Track:

    • Budget:

    iN(i)=10×106 words\sum_i N_{(i)} = 10\times 10^6\ \text{words}

  • Downsampled, not independently resplit.
  1. Multimodal Vision Track:

    • Budget:

    text100×106 words\text{text}\leq 100\times 10^6\ \text{words}

    with recommended split: ≤50 M paired image–caption, ≤50 M text-only; unlimited images.

Participants may construct custom datasets provided they adhere to these strict word budgets and submit a fully detailed datasheet (source, licensing, filtering, quality control) (Choshen et al., 2024). Repeated exposure to the same text (multiple epochs, data augmentation) does not increase the word budget unless new tokens are introduced. For leaderboard eligibility, participants must use only specified train splits, employing dev/test splits for tuning or secondary evaluation (Warstadt et al., 10 Apr 2025, Choshen et al., 2024).

5. Quantitative Properties and Domain Distribution

The BabyLM 2024 corpora are quantitatively characterized by:

  • Text-only corpus: 100 M words (Strict); 10 M words (Strict-Small).

  • Domain balance (Strict, 2024):

    • Child-oriented text: 70%
    • Transcribed speech: 58%
    • Movie subtitles: 20%
    • Children’s literature: 26%
    • Simplified encyclopedia: 15%

Notably, the 2024 dataset substantially increases the proportion of child-oriented content (from 39% in 2023 to 70%) and transcribed speech (from 55% to 58%), and eliminates general English Wikipedia and QED corpora, thus enhancing developmental plausibility (Hu et al., 2024).

Sequence lengths for model pretraining are left unconstrained except by each source’s native structure. Baseline models typically employ maximal sequence lengths between 128 and 512 tokens. No comprehensive sentence- or paragraph-level tokenization is imposed at preprocessing (Warstadt et al., 10 Apr 2025).

The type–token ratio (TTR) is reported as TTR=V/NTTR = |V|/N, where V|V| is the unique vocabulary size and NN is the total word token count. Exact values are tokenizer- and split-dependent (Hu et al., 2024).

6. Impact on Model Training, Evaluation, and Data Efficiency

The BabyLM Challenge dataset enables experimentation with architectures, training regimes, and data augmentation strategies tailored to sample-efficient, cognitively plausible pretraining. It provides the substrate for comparing models on a shared evaluation pipeline, which scores pre-trained LMs on grammatical ability (BLiMP), natural language understanding (GLUE/SuperGLUE), and cognitive science–inspired judgments. Vision track models are further benchmarked on multimodal evaluation tasks (e.g., VQA, pragmatic grounding) (Warstadt et al., 10 Apr 2025, Hu et al., 2024).

Empirical findings suggest that hybrid architectures (e.g., causal-masked LM), modifications of training data (such as the injection of structured variation sets), and training objective innovation yield measurable improvements under fixed-size regimes. No submission to date has outperformed strong baselines in the multimodal track, indicating that effective image–text pretraining remains open for innovation (Hu et al., 2024).

A strong relationship between total training FLOPs and average evaluation performance has been observed, emphasizing the continuing importance of training efficiency even under strict data constraints. Approaches such as curriculum learning and student–teacher distillation yield mixed results; curriculum-based variants have shown only modest or inconsistent improvements under the BabyLM constraints (Warstadt et al., 10 Apr 2025).

7. Evolution, Reproducibility, and Future Directions

Relative to its 2023 predecessor, the 2024 dataset increases child-directed input, eliminates broad-domain outliers, incorporates tightly controlled multimodal content, and relaxes restrictions to allow participant-constructed corpora (subject to ceiling constraints and rigorous datasheet documentation) (Choshen et al., 2024, Hu et al., 2024).

All raw corpora, preprocessing scripts, and splits are made publicly available through official repositories (https://osf.io/ad7qg/, https://github.com/babylm/babylm_data_preprocessing), ensuring reproducibility and transparency. This design supports both direct benchmark replication and the exploration of augmentation, filtering, or synthetic data generation strategies under the developmentally motivated resource ceiling (Hu et al., 2024).

Areas for further research include fine-grained domain balancing, evaluation on cognitively diagnostic benchmarks, improved handling of multimodal data, and disentanglement of variation set effects on structural generalization. The ongoing challenge structure supports cross-lab, cross-method comparisons, driving the field toward more sample-efficient, cognitively inspired LLMs.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to BabyLM Challenge 2024 Dataset.