[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus (2404.06214v2)
Abstract: After last year's successful BabyLM Challenge, the competition will be hosted again in 2024/2025. The overarching goals of the challenge remain the same; however, some of the competition rules will be different. The big changes for this year's competition are as follows: First, we replace the loose track with a paper track, which allows (for example) non-model-based submissions, novel cognitively-inspired benchmarks, or analysis techniques. Second, we are relaxing the rules around pretraining data, and will now allow participants to construct their own datasets provided they stay within the 100M-word or 10M-word budget. Third, we introduce a multimodal vision-and-language track, and will release a corpus of 50% text-only and 50% image-text multimodal data as a starting point for LM model training. The purpose of this CfP is to provide rules for this year's challenge, explain these rule changes and their rationale in greater detail, give a timeline of this year's competition, and provide answers to frequently asked questions from last year's challenge.
- Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92.
- Martin Gerlach and Francesc Font-Clos. 2018. A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics. Computing Research Repository, arXiv:1812.08092.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7):1956–1981.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
- Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 923–929, Portorož, Slovenia. European Language Resources Association (ELRA).
- Brian MacWhinney. 2000. The CHILDES project: The database, volume 2. Psychology Press.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
- Connecting vision and language with localized narratives. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 647–664. Springer.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Trained on 100 million words and still in shape: Bert meets british national corpus. arXiv preprint arXiv:2303.09859.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3):339–371.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
- Call for papers–the babylm challenge: Sample-efficient pretraining on a developmentally plausible corpus. arXiv preprint arXiv:2301.11796.
- Findings of the babylm challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 1–34.
- Towards more human-like language models based on contextualizer pretraining strategy. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 317–326.
- Leshem Choshen (78 papers)
- Ryan Cotterell (226 papers)
- Michael Y. Hu (15 papers)
- Tal Linzen (73 papers)
- Aaron Mueller (35 papers)
- Candace Ross (25 papers)
- Alex Warstadt (35 papers)
- Ethan Wilcox (24 papers)
- Adina Williams (72 papers)
- Chengxu Zhuang (15 papers)