- The paper introduces a self-training framework where Transformers iteratively generate and learn from their own solutions to progressively overcome task difficulties.
- It employs techniques like length filtering and majority voting to enhance accuracy and achieve near-perfect generalization on tasks such as arithmetic and maze solving.
- Experiments demonstrate that pretrained models with a calibrated difficulty schedule accelerate self-improvement while mitigating error avalanches due to label noise.
The paper introduces a self-improvement framework for training Transformer models to overcome challenges in length and easy-to-hard generalization. The approach involves iteratively generating and learning from the model's own solutions, progressively tackling harder problems without modifying the base Transformer architecture. The framework's effectiveness is demonstrated across diverse tasks, including arithmetic, string manipulation, and maze solving.
The key contributions and findings include:
- A self-training framework is applied to train Transformers on arithmetic, maze, and string manipulation tasks, successfully generalizing to extreme out-of-distribution test data.
- The importance of a carefully crafted self-improvement schedule and label filtering based on length and majority voting are highlighted as central to consistent self-improvement.
- The rate of self-improvement can be exponential, and pretrained models can achieve faster acceleration in easy-to-hard generalization.
- Key failure modes of self-correction due to label noise leading to an error avalanche are investigated, and methods for overcoming them through weak verification are discussed.
The paper evaluates the approach on several tasks:
- Reverse Addition: Models trained with self-improvement generalize from 10-digit to 100-digit addition without saturation.
- Forward Addition: Self-improvement with length filtering enables models to generalize to significantly longer sequences than seen during training.
- Multiplication: Combining majority voting and length filtering allows models to achieve near-perfect length generalization. The paper uses a chain-of-thought (CoT) data format similar to \citet{deng2024explicit}, where the input prompt is structured as 9172*9431=, and the label expands the multiplication into stepwise additions, such as: 17442+067801(132331)+0075180(1398490)+00091720=13976630.
- String Copying and Reversing: Models generalize to sequences significantly longer than those in the initial training data.
- Maze Solving: Self-improvement, especially with majority voting, enables models to find shortest paths in mazes with a larger number of nodes and hops.
The following points describe some key components of the approach.
- Task Difficulty: The difficulty level of a problem instance x is denoted as an integer Difficulty(x).
- Data Generation: An initial supervised training dataset D0​ is generated with a fixed difficulty level d0​ by uniformly sampling the difficulty level d≤d0​, followed by independent sampling of the data conditioned on the difficulty.
- D0​: the initial supervised training dataset
- xi​: the input
- yi​: labels
- N0​: the number of examples in D0​
- Self-Improvement: For each subsequent round r, the problem difficulty is increased to dr​. Using the previous model Mr−1​, Nr​ new self-improve data samples Dr​ are generated.
- r: the self-improvement round
- Mr−1​: the previous model
- Nr​: the number of new self-improve data samples
- Dr​: the new self-improve data samples
- Data Filtering: Two unsupervised data-filtering methods are used:
- k: the number of models trained
- Mr−1(1)​,⋯,Mr−1(k)​: the trained k models
- Ï„: a threshold
The dynamics of self-improvement are investigated, revealing that:
- Controlling the weak-to-strong curriculum is crucial to avoid catastrophic failure.
- Self-improvement accelerates over time as models increasingly benefit from harder examples.
- Starting with pretrained models significantly accelerates self-improvement.
The results show that OOD extrapolation capabilities improve progressively as the model undergoes more rounds of self-improvement. A "safe range" for sampling next-round difficulty exists, and the sampling schedule can be accelerated by sampling multiple difficulty levels within this safe range per round, leading to faster improvements in performance.
The paper also analyzes error avalanches in self-improvement, where inaccuracies in self-generated data accumulate, leading to a collapse in performance. The model tolerates errors up to a certain point before a crash in accuracy occurs. Additionally, models can generalize despite memorizing past mistakes.