Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 33 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 220 tok/s Pro
GPT OSS 120B 465 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges (2502.01612v2)

Published 3 Feb 2025 in cs.LG and cs.AI

Abstract: LLMs often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining a standard transformer architecture. Across diverse tasks including arithmetic, string manipulation, and maze solving, self-improving enables models to solve problems far beyond their initial training distribution-for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that in some cases filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds. Additionally, starting from pretrained models significantly accelerates this self-improvement process for several tasks. Our results demonstrate how controlled weak-to-strong curricula can systematically teach a model logical extrapolation without any changes to the positional embeddings, or the model architecture.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a self-training framework where Transformers iteratively generate and learn from their own solutions to progressively overcome task difficulties.
  • It employs techniques like length filtering and majority voting to enhance accuracy and achieve near-perfect generalization on tasks such as arithmetic and maze solving.
  • Experiments demonstrate that pretrained models with a calibrated difficulty schedule accelerate self-improvement while mitigating error avalanches due to label noise.

The paper introduces a self-improvement framework for training Transformer models to overcome challenges in length and easy-to-hard generalization. The approach involves iteratively generating and learning from the model's own solutions, progressively tackling harder problems without modifying the base Transformer architecture. The framework's effectiveness is demonstrated across diverse tasks, including arithmetic, string manipulation, and maze solving.

The key contributions and findings include:

  1. A self-training framework is applied to train Transformers on arithmetic, maze, and string manipulation tasks, successfully generalizing to extreme out-of-distribution test data.
  2. The importance of a carefully crafted self-improvement schedule and label filtering based on length and majority voting are highlighted as central to consistent self-improvement.
  3. The rate of self-improvement can be exponential, and pretrained models can achieve faster acceleration in easy-to-hard generalization.
  4. Key failure modes of self-correction due to label noise leading to an error avalanche are investigated, and methods for overcoming them through weak verification are discussed.

The paper evaluates the approach on several tasks:

  • Reverse Addition: Models trained with self-improvement generalize from 10-digit to 100-digit addition without saturation.
  • Forward Addition: Self-improvement with length filtering enables models to generalize to significantly longer sequences than seen during training.
  • Multiplication: Combining majority voting and length filtering allows models to achieve near-perfect length generalization. The paper uses a chain-of-thought (CoT) data format similar to \citet{deng2024explicit}, where the input prompt is structured as 9172*9431=, and the label expands the multiplication into stepwise additions, such as: 17442+067801(132331)+0075180(1398490)+00091720=13976630.
  • String Copying and Reversing: Models generalize to sequences significantly longer than those in the initial training data.
  • Maze Solving: Self-improvement, especially with majority voting, enables models to find shortest paths in mazes with a larger number of nodes and hops.

The following points describe some key components of the approach.

  • Task Difficulty: The difficulty level of a problem instance xx is denoted as an integer Difficulty(x)\text{Difficulty}(x).
    • xx: a problem instance
  • Data Generation: An initial supervised training dataset D0\mathcal{D}_0 is generated with a fixed difficulty level d0d_0 by uniformly sampling the difficulty level d≤d0d \leq d_0, followed by independent sampling of the data conditioned on the difficulty.
    • D0\mathcal{D}_0: the initial supervised training dataset
    • xix_i: the input
    • yiy_i: labels
    • N0N_0: the number of examples in D0\mathcal{D}_0
  • Self-Improvement: For each subsequent round rr, the problem difficulty is increased to drd_r. Using the previous model Mr−1M_{r-1}, NrN_r new self-improve data samples Dr\mathcal{D}_r are generated.
    • rr: the self-improvement round
    • Mr−1M_{r-1}: the previous model
    • NrN_r: the number of new self-improve data samples
    • Dr\mathcal{D}_r: the new self-improve data samples
  • Data Filtering: Two unsupervised data-filtering methods are used:
    • kk: the number of models trained
    • Mr−1(1),⋯ ,Mr−1(k)M_{r-1}^{(1)}, \cdots , M_{r-1}^{(k)}: the trained kk models
    • Ï„\tau: a threshold

The dynamics of self-improvement are investigated, revealing that:

  • Controlling the weak-to-strong curriculum is crucial to avoid catastrophic failure.
  • Self-improvement accelerates over time as models increasingly benefit from harder examples.
  • Starting with pretrained models significantly accelerates self-improvement.

The results show that OOD extrapolation capabilities improve progressively as the model undergoes more rounds of self-improvement. A "safe range" for sampling next-round difficulty exists, and the sampling schedule can be accelerated by sampling multiple difficulty levels within this safe range per round, leading to faster improvements in performance.

The paper also analyzes error avalanches in self-improvement, where inaccuracies in self-generated data accumulate, leading to a collapse in performance. The model tolerates errors up to a certain point before a crash in accuracy occurs. Additionally, models can generalize despite memorizing past mistakes.