2nd BabyLM Challenge Overview

Updated 19 November 2025

The 2nd BabyLM Challenge is an academic shared task that fosters language model efficiency using limited, child-like corpora and multimodal setups.
This challenge introduces new competition tracks—including a paper track and vision-language track—to assess both linguistic and multimodal performance.
Evaluations span grammaticality, pragmatic reasoning, and classification benchmarks, offering actionable insights to bridge the child–machine data-efficiency gap.

The 2nd BabyLM Challenge is an academic shared task designed to empirically advance LLM (LM) sample-efficiency and cognitive plausibility by restricting pretraining to small, developmentally plausible corpora—mirroring the limited linguistic input available to young children. This edition builds directly on the inaugural 2023 challenge, instituting new competition tracks, relaxed data construction rules, and the introduction of multimodal vision-and-language training regimes. Evaluations span grammaticality, downstream classification, pragmatic reasoning, and multimodal grounding, with innovations designed for cumulative benchmarking and comparative research (Choshen et al., 2024).

1. Motivations and Evolution of the Challenge

The challenge arose from the recognition that humans—specifically children—achieve remarkable linguistic generalization after exposure to fewer than 100 million words. Standard LLMs require up to 1,000×–10,000× more data for comparable achievements and often lack the robustness and cognitive plausibility sought in scientific modeling. The 2024/2025 BabyLM Challenge retains a focus on sample-efficient LM training under tightly controlled word budgets but introduces several notable modifications: (i) participants are permitted to assemble custom corpora within the stated budgets (100 M or 10 M words), (ii) the previous 'Loose' track is replaced with a 'Paper' track to incentivize non-model contributions, and (iii) a vision-and-language track is established to reflect the multimodal nature of human language acquisition (Choshen et al., 2024).

2. Competition Tracks, Data Rules, and Rationale

The challenge accommodates three main tracks:

Track A (Strict & Strict-Small): Participants train language-only LMs using, at most, 100 M or 10 M words, respectively. The task suite is fixed to isolate modeling advances.
Track B (Vision-Language): Models are trained on up to 100 M words (counting only the text) with guidance for a 50/50 split between text-only and image-text paired data—50 M words from curated texts, supplemented by 50 M paired with images from Localized Narratives and Conceptual Captions 3M. This track evaluates both classical linguistic benchmarks and multimodal log-likelihood estimation given images.
Track C (Paper-Only): Submissions may consist of archival papers proposing new cognitively-inspired benchmarks, analysis techniques, or evaluation metrics in lieu of end-to-end model training.

Distinct rule changes include relaxation of corpus selection (standardized datasheet documentation required for custom data), maintenance of strict word-count budgets, and introduction of multimodal fusion in recognition of findings that multimodal interaction is central to early language learning in humans (Choshen et al., 2024).

3. Corpus Curation and Pretraining Methodology

Participants may construct training corpora using sources such as child-directed speech (CHILDES), children’s books (Project Gutenberg), dialogue corpora (BNC), movie subtitles (OpenSubtitles), and Simple English Wikipedia, provided the total does not exceed the track’s word budget. A standardized datasheet is mandated for novel datasets, aligning with established best practices for transparency and ethical curation. For the vision-language track, organizers provide 50 M words of image-caption pairs (Localized Narratives, Conceptual Captions 3M) and 50 M words of text-only data. All source splits are public, with accompanying preprocessing scripts (Choshen et al., 2024).

Corpus selection and filtering have proven crucial; research demonstrates that structural richness and relevance to child input (e.g., inclusion of media dialogue) yield higher zero-shot syntactic and NLU scores than general-domain corpora such as MADLAD (Ghanizadeh et al., 6 Mar 2025). Vocabulary size tuning (typically ~32k tokens), stringent deduplication, and curriculum learning schedules that incrementally expose models to more complex data further augment efficiency.

4. Evaluation Protocols and Benchmarks

Evaluations are conducted via an official pipeline (Google Colab with catwalk integration; local execution is also supported). Models are required to assign (pseudo) log-likelihoods to text strings and demonstrate fine-tunability for classification tasks. Principal benchmarks include:

BLiMP and BLiMP Supplement: Minimal-pair acceptability judgments covering syntax, agreement, and pragmatic phenomena.
(Super)GLUE: Suite of classification and QA tasks enabling downstream performance assessment via fine-tuning.
Pragmatic and Commonsense Reasoning: (e.g., EWoK and additional hidden tasks) for world-knowledge probing.
Multimodal Tasks: For vision-LLMs, multimodal benchmarks mandate conditional generation and scoring on VQA and Winoground datasets.

Baselines are provided (GPT-2, LTG-BERT, Contextualizer for text-only; GIT and Flamingo for multimodal), and success is measured by improvements over baselines under identical data budgets and insights contributing to multimodal grounding or cognitive plausibility (Choshen et al., 2024).

5. Key Innovations and Empirical Findings

Submissions frequently employed hybrid causal-masked objectives, architectural advances (e.g., Every-Layer-Counts BERT, GEGLU feed-forward modules, disentangled attention), and exhaustive pretraining regimens (up to 2,000 epochs in some cases) (Warstadt et al., 10 Apr 2025, Hu et al., 2024). Ensemble knowledge distillation (BabyLlama-2) was shown to outperform teachers and baselines significantly even with identical model size and data (Tastet et al., 2024). Data augmentation approaches, such as synthetic story generation and corpus-interleaving (Contextualizer), produced notable gains in GLUE and world-knowledge tasks (Theodoropoulos et al., 2024, Warstadt et al., 10 Apr 2025).

Curriculum learning attempts, while conceptually motivated, yielded mixed results—context/window size restriction provided consistent benefits for syntactic generalization, but complexity-based or vocabulary-stage curriculums offered only marginal or non-significant improvements (Martinez et al., 2023, Edman et al., 2023). Paraphrastic and contrastive data, inspired by L2 pedagogy, was more effective in boosting NLU metrics than explicit instructive knowledge (e.g., definitions or grammatical tags) (Edman et al., 2024). StructFormer-style architectural biases improved performance on several syntactic phenomena, supporting the use of hierarchical inductive priors under scarce data (Momen et al., 2023).

6. Impact, Recommendations, and Open Challenges

The challenge has demonstrated that architectural modifications, carefully constructed objectives, and targeted data augmentation together can close a substantial portion of the child–machine data-efficiency gap for linguistic and NLU generalization (Hu et al., 2024). RNN structures remain competitive at these scales. Task-dependent context size selection (preferring 256–512 for syntax and up to 8192 for document-level tasks) yields significant compute savings without performance drop (Salhan et al., 22 Oct 2025).

Multimodal grounding remains an unsolved frontier—none of the 2024 vision-language submissions outperformed strong text-only adaptations of existing V+L architectures under the 100 M-word regime (Hu et al., 2024). Future directions include more flexible evaluation protocols, explicit tracking of compute/FLOPs, exploration of unsupervised structural bias (hybrid parsing-tree induction), advanced distillation schemes, and extension to L2 and paraphrase-driven curricula.

A plausible implication is that carefully chosen mixture ratios of curated child-oriented and synthetic data, combined with hybrid CLM/MLM objectives, represent the current best practice for sample-efficient LM pretraining under developmentally plausible budgets (Hu et al., 2024, Theodoropoulos et al., 2024).

7. Timeline, Community Process, and Accessibility

The 2nd BabyLM Challenge follows a clearly defined timeline in 2024: data release (March 30), evaluation pipeline launch (April 30), submission deadlines (September 13/20), peer review and leaderboard publication (October), and conference participation (December, NeurIPS subject to acceptance) (Choshen et al., 2024). All corpora, baselines, evaluation scripts, and many submission codes are publicly available. Participants submit their models and papers via Dynabench and OpenReview, with a community-driven peer review process and active engagement in shared infrastructure improvements.

References:

(Choshen et al., 2024, Hu et al., 2024, Warstadt et al., 10 Apr 2025, Theodoropoulos et al., 2024, Martinez et al., 2023, Tastet et al., 2024, Salhan et al., 22 Oct 2025, Edman et al., 2023, Edman et al., 2024, Momen et al., 2023, Ghanizadeh et al., 6 Mar 2025)