Pre-training LLMs using human-like development data corpus (2311.04666v4)
Abstract: Pre-trained LLMs have shown success in a diverse set of language inference and understanding tasks. The pre-training stage of LLMs looks at a large corpus of raw textual data. The BabyLM shared task compares LLM pre-training to human language acquisition, where the number of tokens seen by 13-year-old kids is magnitudes smaller than the number of tokens seen by LLMs. In this work, we pre-train and evaluate LLMs on their ability to learn contextual word representations using roughly the same number of tokens as seen by children. We provide a strong set of baselines; with different architectures, evaluation of changes in performance across epochs, and reported pre-training metrics for the strict small and strict tracks of the task. We also try to loosely replicate the RoBERTa baseline given by the task organizers to observe the training robustness to hyperparameter selection and replicability. We provide the submission details to the strict and strict-small tracks in this report.
- Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
- Scibert: Pretrained language model for scientific text. In EMNLP.
- Pythia: A suite for analyzing large language models across training and scaling.
- Language models are few-shot learners.
- Hatebert: Retraining bert for abusive language detection in english.
- Electra: Pre-training text encoders as discriminators rather than generators. ArXiv, abs/2003.10555.
- Emmanuel Dupoux. 2018. Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173:43–59.
- VisToT: Vision-augmented table-to-text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9936–9949, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Training compute-optimal large language models.
- BabyBERTa: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 624–646, Online. Association for Computational Linguistics.
- Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692.
- OpenAI. 2023. Gpt-4 technical report.
- Social simulacra: Creating populated prototypes for social computing systems.
- How much pretraining data do language models need to learn syntax? arXiv preprint arXiv:2109.03160.
- Reasoning like program executors.
- Logigan: Learning logical reasoning via adversarial pre-training. Advances in Neural Information Processing Systems, 35:16290–16304.
- Language models are unsupervised multitask learners.
- Exploring the limits of transfer learning with a unified text-to-text transformer.
- T. C. Rajapakse. 2019. Simple transformers. https://github.com/ThilinaRajapakse/simpletransformers.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.
- When flue meets flang: Benchmarks and large pretrained language model for financial domain. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
- Human behavioral benchmarking: Numeric magnitude comparison effects in large language models.
- Scale efficiently: Insights from pre-training and fine-tuning transformers. arXiv preprint arXiv:2109.10686.
- Attention is all you need. ArXiv, abs/1706.03762.
- Trl: Transformer reinforcement learning. https://github.com/lvwerra/trl.
- Findings of the 2023 BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the 2023 BabyLM Challenge. Association for Computational Linguistics (ACL).
- Wikipedia contributors. 2004. Wikipedia, the free encyclopedia.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Imitation versus innovation: What children can do that large language and language-and-vision models cannot (yet)?
- Improving question answering by commonsense-based pre-training. In Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part I 8, pages 16–28. Springer.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.