Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChapGTP, ILLC's Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation (2310.11282v1)

Published 17 Oct 2023 in cs.CL

Abstract: We present the submission of the ILLC at the University of Amsterdam to the BabyLM challenge (Warstadt et al., 2023), in the strict-small track. Our final model, ChapGTP, is a masked LLM that was trained for 200 epochs, aided by a novel data augmentation technique called Automatic Task Formation. We discuss in detail the performance of this model on the three evaluation suites: BLiMP, (Super)GLUE, and MSGS. Furthermore, we present a wide range of methods that were ultimately not included in the model, but may serve as inspiration for training LMs in low-resource settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. The amara corpus: Building parallel language resources for the educational domain. In LREC, volume 14, pages 1044–1054.
  2. Transferring Inductive Biases through Knowledge Distillation. ArXiv:2006.00555 [cs, stat].
  3. Nur Ahmed and Muntasir Wahed. 2020. The De-democratization of AI: Deep Learning and the Compute Divide in Artificial Intelligence Research. ArXiv:2010.15581 [cs].
  4. Diane Bouchacourt and Marco Baroni. 2018. How agents see things: On visual representations in an emergent language game. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 981–985.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  6. Grzegorz Chrupała. 2023. Putting Natural in Natural Language Processing. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7820–7827, Toronto, Canada. Association for Computational Linguistics.
  7. Generalising to German plural noun classes, from the perspective of a recurrent neural network. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 94–108, Online. Association for Computational Linguistics.
  8. Unsupervised latent tree induction with deep inside-outside recursive autoencoders. In North American Association for Computational Linguistics.
  9. Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4508–4513, Online. Association for Computational Linguistics.
  10. A Survey of Data Augmentation Approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online. Association for Computational Linguistics.
  11. Martin Gerlach and Francesc Font-Clos. 2018. A standardized project gutenberg corpus for statistical analysis of natural language and quantitative linguistics. CoRR, abs/1812.08092.
  12. Prosodic Bootstrapping. In Carlos Gussenhhoven and Aoju Chen, editors, The Oxford Handbook of Language Prosody. Oxford University Press.
  13. Serhii Havrylov and Ivan Titov. 2017. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. Advances in neural information processing systems, 30.
  14. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  15. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.
  16. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030.
  17. An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385–393, Dublin, Ireland. Association for Computational Linguistics.
  18. BabyBERTa: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 624–646, Online. Association for Computational Linguistics.
  19. Question Answering Infused Pre-training of General-Purpose Contextualized Representations. In Findings of the Association for Computational Linguistics: ACL 2022, pages 711–728, Dublin, Ireland. Association for Computational Linguistics.
  20. Jean Kaddour. 2023. The minipile challenge for data-efficient language models. arXiv preprint arXiv:2304.08442.
  21. Scaling laws for neural language models. CoRR, abs/2001.08361.
  22. Simon Kirby. 2002. Natural language from artificial life. Artificial life, 8(2):185–215.
  23. Compression and communication in the cultural evolution of linguistic structure. Cognition, 141:87–102.
  24. Natural language does not emerge ‘naturally’in multi-agent dialog. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2962–2967.
  25. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  26. Angeliki Lazaridou and Marco Baroni. 2020. Emergent multi-agent communication in the deep learning era. arXiv preprint arXiv:2006.02419.
  27. Multi-agent communication meets natural language: Synergies between functional and structural language learning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7663–7674.
  28. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland. Association for Computational Linguistics.
  29. Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators. In Proceedings of the 5th International Conference on Conversational User Interfaces, CUI ’23, pages 1–6, New York, NY, USA. Association for Computing Machinery.
  30. Tal Linzen. 2020. How Can We Accelerate Progress Towards Human-like Linguistic Generalization? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5210–5217, Online. Association for Computational Linguistics.
  31. Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles.
  32. Roberta: A robustly optimized bert pretraining approach.
  33. Towards understanding grokking: An effective theory of representation learning. In Advances in Neural Information Processing Systems.
  34. Omnigrok: Grokking beyond algorithmic data. In The Eleventh International Conference on Learning Representations.
  35. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  36. Internal and external pressures on language emergence: least effort, object constancy and frequency. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4428–4437.
  37. Brian MacWhinney. 2000. The CHILDES project: The database, volume 2. Psychology Press.
  38. Prosodic cues enhance infants’ sensitivity to nonadjacent regularities. Science Advances, 9(15):eade4083. Publisher: American Association for the Advancement of Science.
  39. R. Thomas McCoy and Thomas L. Griffiths. 2023. Modeling rapid language learning by distilling Bayesian priors into artificial neural networks. ArXiv:2305.14701 [cs].
  40. Inverse scaling: When bigger isn’t better.
  41. Swaroop Mishra and Bhavdeep Singh Sachdeva. 2020. Do we need to create big datasets to learn a task? In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 169–173, Online. Association for Computational Linguistics.
  42. Grokking of hierarchical structure in vanilla transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 439–448, Toronto, Canada. Association for Computational Linguistics.
  43. Isabel Papadimitriou and Dan Jurafsky. 2023. Pretrain on just structure: Understanding linguistic inductive biases using transfer learning. ArXiv:2304.13060 [cs].
  44. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  45. Grokking: Generalization beyond overfitting on small algorithmic datasets.
  46. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  47. Green AI. Commun. ACM, 63(12):54–63.
  48. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  49. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
  50. Noam Shazeer. 2020. Glu variants improve transformer.
  51. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  52. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational linguistics, 26(3):339–373.
  53. Roformer: Enhanced transformer with rotary position embedding.
  54. Hierarchical representation and estimation of prosody using continuous wavelet transform. Computer Speech & Language, 45:123–136.
  55. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon.
  56. Llama: Open and efficient foundation language models.
  57. Quantifying the perceptual value of lexical and non-lexical channels in speech. In Interspeech 2023: 24th Annual Conference of the International Speech Communication Association.
  58. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Curran Associates Inc., Red Hook, NY, USA.
  59. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  60. Findings of the BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge. Association for Computational Linguistics (ACL).
  61. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
  62. Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 217–235, Online. Association for Computational Linguistics.
  63. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  64. Linking emergent and natural languages via corpus transfer. arXiv preprint arXiv:2203.13344.
  65. Biao Zhang and Rico Sennrich. 2019. Root Mean Square Layer Normalization. Curran Associates Inc., Red Hook, NY, USA.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jaap Jumelet (25 papers)
  2. Michael Hanna (11 papers)
  3. Marianne de Heer Kloots (6 papers)
  4. Anna Langedijk (4 papers)
  5. Charlotte Pouw (5 papers)
  6. Oskar van der Wal (9 papers)
Citations (3)