Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 90 tok/s Pro

Kimi K2 211 tok/s Pro

GPT OSS 120B 459 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Aligning Instruction Tuning with Pre-training (2501.09368v3)

Published 16 Jan 2025 in cs.AI

Abstract: Instruction tuning enhances LLMs to follow human instructions across diverse tasks, relying on high-quality datasets to guide behavior. However, these datasets, whether manually curated or synthetically generated, are often narrowly focused and misaligned with the broad distributions captured during pre-training, limiting LLM generalization and effective use of pre-trained knowledge. We propose Aligning Instruction Tuning with Pre-training (AITP), a method that bridges this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs. This approach enriches dataset diversity while preserving task-specific objectives. Evaluations on three fully open LLMs across eight benchmarks demonstrate consistent performance improvements with AITP. Ablations highlight the benefits of adaptive data selection, controlled rewriting, and balanced integration, emphasizing the importance of aligning instruction tuning with pre-training distributions to unlock the full potential of LLMs.

Collections

Summary

The paper presents AITP, a method that bridges the distributional gap between instruction tuning and pre-training datasets to enhance LLM generalization.
It introduces a three-stage process—difference set generation, data transformation, and combined training—using techniques like PCA and KDE.
Experimental results show consistent performance improvements across multiple benchmarks, with optimal gains achieved using less than 10% rewritten data.

The paper introduces Aligning Instruction Tuning with Pre-training (AITP), a methodology designed to address the distributional disparities between instruction-tuning datasets and pre-training corpora in LLMs. Instruction tuning, which is crucial for adapting LLMs to follow human instructions, often relies on narrowly focused datasets that fail to capture the broad distributions present during pre-training, thus limiting the generalization capabilities of LLMs. AITP seeks to bridge this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs.

The AITP method involves three key stages:

Difference Set Generation: This stage isolates data points from the pre-training corpora that are dissimilar to those in the Supervised Fine-Tuning (SFT) dataset. This process identifies regions in the pre-training data distribution that are either absent or sparsely populated in the SFT data. The difference set, $D_{diff}$ $D_{d i ff}$ , is formally defined as:

$D_{diff} = \{d_i | d_i \in D_{pretrain}, A(d_i, D_{SFT}) < T \}$

where:
- $D_{pretrain}$ is the pre-training dataset.
- $D_{SFT}$ is the SFT dataset.
- $A(d_i, D_{SFT})$ represents the density estimate of the data point $d_i$ in the SFT dataset.
- $T$ is the threshold determining whether a data point should be included in the difference set.
Each data point is represented as a vector derived from the final-layer embedding of the model, followed by dimensionality reduction using Principal Component Analysis (PCA) to project these high-dimensional embeddings into two-dimensional coordinates, which is formalized as:

$(x_i, y_i) = DR(Model(d_i))$

where: * $DR$ is the dimensionality reduction function. * $Model(d_i)$ is the model's embedding of the data point $d_i$ .

Kernel Density Estimation (KDE) is then employed to visualize the density of points for each dataset, using the function:

$f(x,y) = \frac{1}{nh_xh_y} \sum_{i=1}^{n} K(\frac{x - x_i}{h_x}, \frac{y - y_i}{h_y})$

where: * $K(\cdot,\cdot)$ is the kernel function (typically Gaussian). * $(x, y)$ and $(x_i, y_i)$ are two two-dimensional data points. * $h_x$ and $h_y$ are bandwidth parameters controlling smoothness in the x and y directions, respectively.
Data Transformation of Difference Set: In this phase, raw text from the pre-training data within the difference set is converted into instruction-pair data suitable for SFT. A query generation prompt guides the model in creating relevant questions from the raw text. A query scoring prompt then assesses the quality of each generated query, filtering out low-quality queries to conserve computational resources. Finally, an answer generation prompt is used to generate responses to the remaining high-quality queries.
Training: The model is trained on a combined dataset that includes both the rewritten data from the difference set and the original SFT dataset.

Experiments were conducted using three open-source models: OLMo, MAP-Neo, and Pythia. The models not only release model weights but also training datasets and intermediate checkpoints. OLMo-7B-base, MAP-Neo-7B-base, and Pythia-12B were used as the foundational setup for AITP, with OLMo-7B-SFT and MAP-Neo-7B-SFT-v0.1 serving as baselines. Since the SFT dataset for Pythia has not been released, Tulu-v2 was used for fine-tuning as the baseline for Pythia.

The IFEval benchmark was used to evaluate the model's instruction-following ability, providing accuracy scores such as Prompt-level Strict-accuracy (P-S), Instruction-level Strict-accuracy (I-S), Prompt-level Loose-accuracy (P-L), and Instruction-level Loose-accuracy (I-L). Additional benchmarks included MMLU, ARC-c, GPQA-diamond, HumanEval, MBPP, HellaSwag, and GSM8K.

The results showed that models trained with AITP achieved average performance improvements of 3.77, 1.11, and 0.97 across eight benchmarks compared to the SFT baselines of OLMo, MAP-Neo, and Pythia, respectively. The authors attribute this improvement to AITP supplementing the original SFT dataset with lacking data, expanding its coverage, and optimizing its distribution. The difference set includes data from the pre-training corpus that is lacking in SFT datasets, such as code and scientific literature data. Though the distribution narrows during the rewriting process, the final combined dataset expands the coverage of the original SFT dataset.

Ablation studies were performed to evaluate the impact of dataset size and distillation during the data transformation process on AITP. The results indicated that the AITP method achieves an average absolute improvement of 2.08, even with the same dataset size. The distillation setting did not outperform the OLMo-SFT baseline, suggesting that the improvement does not result from distillation by an aligned model. Further investigation into the effect of incorporating various ratios of rewritten difference data on AITP showed that AITP achieves excellent performance with a rewritten data set comprising less than 10% of the original SFT dataset. However, performance declines as the size of the rewritten set increases. This suggests that incorporating a small amount of rewritten data improves model performance significantly by filling gaps in the original SFT data, while increasing the rewritten ratio degrades overall data quality due to the lower quality of the rewritten data compared to the original SFT dataset.