Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 65 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 39 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 97 tok/s Pro

Kimi K2 164 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges (2502.01612v2)

Published 3 Feb 2025 in cs.LG and cs.AI

Abstract: LLMs often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining a standard transformer architecture. Across diverse tasks including arithmetic, string manipulation, and maze solving, self-improving enables models to solve problems far beyond their initial training distribution-for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that in some cases filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds. Additionally, starting from pretrained models significantly accelerates this self-improvement process for several tasks. Our results demonstrate how controlled weak-to-strong curricula can systematically teach a model logical extrapolation without any changes to the positional embeddings, or the model architecture.

Collections

Summary

The paper introduces a self-training framework where Transformers iteratively generate and learn from their own solutions to progressively overcome task difficulties.
It employs techniques like length filtering and majority voting to enhance accuracy and achieve near-perfect generalization on tasks such as arithmetic and maze solving.
Experiments demonstrate that pretrained models with a calibrated difficulty schedule accelerate self-improvement while mitigating error avalanches due to label noise.

The paper introduces a self-improvement framework for training Transformer models to overcome challenges in length and easy-to-hard generalization. The approach involves iteratively generating and learning from the model's own solutions, progressively tackling harder problems without modifying the base Transformer architecture. The framework's effectiveness is demonstrated across diverse tasks, including arithmetic, string manipulation, and maze solving.

The key contributions and findings include:

A self-training framework is applied to train Transformers on arithmetic, maze, and string manipulation tasks, successfully generalizing to extreme out-of-distribution test data.
The importance of a carefully crafted self-improvement schedule and label filtering based on length and majority voting are highlighted as central to consistent self-improvement.
The rate of self-improvement can be exponential, and pretrained models can achieve faster acceleration in easy-to-hard generalization.
Key failure modes of self-correction due to label noise leading to an error avalanche are investigated, and methods for overcoming them through weak verification are discussed.

The paper evaluates the approach on several tasks:

Reverse Addition: Models trained with self-improvement generalize from 10-digit to 100-digit addition without saturation.
Forward Addition: Self-improvement with length filtering enables models to generalize to significantly longer sequences than seen during training.
Multiplication: Combining majority voting and length filtering allows models to achieve near-perfect length generalization. The paper uses a chain-of-thought (CoT) data format similar to \citet{deng2024explicit}, where the input prompt is structured as 9172*9431=, and the label expands the multiplication into stepwise additions, such as: 17442+067801(132331)+0075180(1398490)+00091720=13976630.
String Copying and Reversing: Models generalize to sequences significantly longer than those in the initial training data.
Maze Solving: Self-improvement, especially with majority voting, enables models to find shortest paths in mazes with a larger number of nodes and hops.

The following points describe some key components of the approach.

Task Difficulty: The difficulty level of a problem instance $x$ $x$ is denoted as an integer $\text{Difficulty}(x)$ $Difficulty (x)$ .
- $x$ : a problem instance
Data Generation: An initial supervised training dataset $\mathcal{D}_0$ $D_{0}$ is generated with a fixed difficulty level $d_0$ $d_{0}$ by uniformly sampling the difficulty level $d \leq d_0$ $d \leq d_{0}$ , followed by independent sampling of the data conditioned on the difficulty.
- $\mathcal{D}_0$ : the initial supervised training dataset
- $x_i$ : the input
- $y_i$ : labels
- $N_0$ : the number of examples in $\mathcal{D}_0$
Self-Improvement: For each subsequent round $r$ $r$ , the problem difficulty is increased to $d_r$ $d_{r}$ . Using the previous model $M_{r-1}$ $M_{r - 1}$ , $N_r$ $N_{r}$ new self-improve data samples $\mathcal{D}_r$ $D_{r}$ are generated.
- $r$ : the self-improvement round
- $M_{r-1}$ : the previous model
- $N_r$ : the number of new self-improve data samples
- $\mathcal{D}_r$ : the new self-improve data samples
Data Filtering: Two unsupervised data-filtering methods are used:
- $k$ : the number of models trained
- $M_{r-1}^{(1)}, \cdots , M_{r-1}^{(k)}$ : the trained $k$ models
- $\tau$ : a threshold

The dynamics of self-improvement are investigated, revealing that:

Controlling the weak-to-strong curriculum is crucial to avoid catastrophic failure.
Self-improvement accelerates over time as models increasingly benefit from harder examples.
Starting with pretrained models significantly accelerates self-improvement.

The results show that OOD extrapolation capabilities improve progressively as the model undergoes more rounds of self-improvement. A "safe range" for sampling next-round difficulty exists, and the sampling schedule can be accelerated by sampling multiple difficulty levels within this safe range per round, leading to faster improvements in performance.

The paper also analyzes error avalanches in self-improvement, where inaccuracies in self-generated data accumulate, leading to a collapse in performance. The model tolerates errors up to a certain point before a crash in accuracy occurs. Additionally, models can generalize despite memorizing past mistakes.