Matching domain experts by training from scratch on domain knowledge (2405.09395v2)

Published 15 May 2024 in q-bio.NC, cs.AI, and cs.CL

Abstract: Recently, LLMs have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.

PDF HTML Abstract

Matching Domain Experts by Training from Scratch on Domain Knowledge

The paper under discussion investigates the potential for domain-specific training of smaller LLMs (LMs) to match expert-level performance, particularly in the field of neuroscience. It challenges the notion that LLMs, with their vast parameter sizes and extensive datasets, are the sole viable path to achieving superior predictive accuracy in specialized domains. The authors conduct experiments using a 124M-parameter GPT-2 model, exploring the significance of tailored training data and custom tokenization in achieving expert-level performance.

The authors conduct a meticulous assessment of a smaller LM, GPT-2, configured with 124 million parameters, trained on 1.3 billion tokens specific to neuroscience. The paper employs BrainBench, a benchmark designed to examine forward-looking prediction capabilities in neuroscience, featuring scenarios derived from actual scientific abstracts with strategically modified outcomes. Despite GPT-2's significantly smaller size compared to larger LLMs evaluated by Luo et al., the authors demonstrate that with appropriate training strategies, such a model can reach an accuracy comparable to expert neuroscientists.

Importantly, the paper reveals two effective training routes: fine-tuning a pre-trained model on domain-specific data and training a model from scratch using a specialized tokenizer informed by the same domain. The results show that both approaches yield approximately 63% accuracy on BrainBench – a performance comparable to human experts at 63.4%. The fact that these results are achieved with a fraction of the resources used for larger LLMs, both in terms of model parameters and dataset size, underscores the importance of domain specificity in training procedures.

The use of a customized tokenizer further underscores the impact of preserving domain-specific terminology, highlighting that the neuro-tokenizer contains twice the proportion of neuroscience-related tokens compared to a generic tokenizer. This tokenization enhancement allows better context retention and term fidelity within the training data's fixed context window, leading to improved model performance.

Methodologically, models are evaluated via perplexity to gauge accuracy in distinguishing between original and altered abstracts, a measure particularly relevant in the context of scientific abstraction. Interestingly, the paper further establishes that models employing domain-focused training not only match human competency but also provide insights into the statistical underpinnings of scientific discovery, hinting at the future potential for LMs to function as robust tools in experimental prediction and hypothesis generation.

The implications of these findings suggest that moderately sized LMs, when intelligently trained with domain-relevant data and tokenization, can effectively serve specialized fields like neuroscience, democratizing access to advanced AI capabilities to smaller research groups with limited computational resources. This points to a shift in understanding that high-level performance does not solely depend on broader parameter scales or extensive corpus diversity. It offers a promising insight into more resource-efficient and specialized AI applications.

Looking ahead, this research invites further investigation into the balance between model complexity and domain specificity, as well as potential applications in other knowledge-intensive fields such as medical diagnostics or legal document analysis. It draws focus toward the efficient deployment of AI systems, showing that strategic data preparation and tailored linguistic models can meaningfully empower expert-level capabilities without necessitating the computational overhead typically associated with larger LLMs.