Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pretraining on the Test Set Is All You Need (2309.08632v1)

Published 13 Sep 2023 in cs.CL and cs.AI

Abstract: Inspired by recent work demonstrating the promise of smaller Transformer-based LLMs pretrained on carefully curated data, we supercharge such approaches by investing heavily in curating a novel, high quality, non-synthetic data mixture based solely on evaluation benchmarks. Using our novel dataset mixture consisting of less than 100 thousand tokens, we pretrain a 1 million parameter transformer-based LLM \textbf{phi-CTNL} (pronounced ``fictional") that achieves perfect results across diverse academic benchmarks, strictly outperforming all known foundation models. \textbf{phi-CTNL} also beats power-law scaling and exhibits a never-before-seen grokking-like ability to accurately predict downstream evaluation benchmarks' canaries.

Introduction

In an effort to explore the boundaries of efficiency in LLM pretraining, a novel approach involving a transformer-based LLM with 1 million parameters, named phi-CTNL, is introduced. The paper stands counter to the conventional large-scale pretraining models, showing that smaller models can reach high performance when trained on a meticulously crafted, high-quality dataset. Phi-CTNL outperforms all known foundation models, boasting a ‘grokking-like’ ability – it can quickly adapt and exhibit understanding of data in ways previously not demonstrated, achieving perfect scores on multiple academic benchmarks.

Pretraining Data

To accomplish these results, the phi-CTNL model is trained on less than 100 thousand tokens derived from a selection of evaluation benchmarks that it is eventually tested on. These benchmarks include well-known datasets in the field of AI such as the AI2 Reasoning Challenge, BoolQ, and SQUAD, to name a few. The paper articulates that this method of targeted pretraining on benchmark datasets yields superior results compared to pretraining on a broader array of datasets, which underpins the model's unprecedented performance.

Novel Capabilities

The paper discusses two groundbreaking characteristics of the phi-CTNL model. Firstly, it demonstrates quicker learning capabilities, beating the traditional power-law scaling that relates model learning performance with computational resources. Secondly, it showcases the model's unusual ability to 'grok' or abruptly grasp benchmark canaries—special tokens in benchmarks designed to test the model's understanding. This implies the model can hit a sudden leap in predicting benchmark outputs accurately, a property not observed in other models.

Discussion

The research concludes on the note that phi-CTNL, with significantly fewer parameters, not only outshines larger models on academic evaluations but also prompts a rethink of the industry's current trajectory focusing on larger models. Data quality and careful curation emerge as pivotal factors for pretraining effectiveness. The authors, however, reveal a twist: the paper is a satirical piece encouraging readers to critically assess ambitious claims in AI research and pay heed to issues like data contamination. This disclaimer emphasizes the importance of rigorous testing for LLMs, especially under the increasing complexity of potential dataset biases and the growing trend toward larger model sizes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Rylan Schaeffer (33 papers)
Citations (21)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews