- The paper presents AITP, a method that bridges the distributional gap between instruction tuning and pre-training datasets to enhance LLM generalization.
- It introduces a three-stage process—difference set generation, data transformation, and combined training—using techniques like PCA and KDE.
- Experimental results show consistent performance improvements across multiple benchmarks, with optimal gains achieved using less than 10% rewritten data.
The paper introduces Aligning Instruction Tuning with Pre-training (AITP), a methodology designed to address the distributional disparities between instruction-tuning datasets and pre-training corpora in LLMs. Instruction tuning, which is crucial for adapting LLMs to follow human instructions, often relies on narrowly focused datasets that fail to capture the broad distributions present during pre-training, thus limiting the generalization capabilities of LLMs. AITP seeks to bridge this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs.
The AITP method involves three key stages:
- Difference Set Generation: This stage isolates data points from the pre-training corpora that are dissimilar to those in the Supervised Fine-Tuning (SFT) dataset. This process identifies regions in the pre-training data distribution that are either absent or sparsely populated in the SFT data. The difference set, Ddiff, is formally defined as:
Ddiff={di∣di∈Dpretrain,A(di,DSFT)<T}
where:
- Dpretrain is the pre-training dataset.
- DSFT is the SFT dataset.
- A(di,DSFT) represents the density estimate of the data point di in the SFT dataset.
- T is the threshold determining whether a data point should be included in the difference set.
Each data point is represented as a vector derived from the final-layer embedding of the model, followed by dimensionality reduction using Principal Component Analysis (PCA) to project these high-dimensional embeddings into two-dimensional coordinates, which is formalized as:
(xi,yi)=DR(Model(di))
where:
* DR is the dimensionality reduction function.
* Model(di) is the model's embedding of the data point di.
Kernel Density Estimation (KDE) is then employed to visualize the density of points for each dataset, using the function:
f(x,y)=nhxhy1i=1∑nK(hxx−xi,hyy−yi)
where:
* K(⋅,⋅) is the kernel function (typically Gaussian).
* (x,y) and (xi,yi) are two two-dimensional data points.
* hx and hy are bandwidth parameters controlling smoothness in the x and y directions, respectively.
- Data Transformation of Difference Set: In this phase, raw text from the pre-training data within the difference set is converted into instruction-pair data suitable for SFT. A query generation prompt guides the model in creating relevant questions from the raw text. A query scoring prompt then assesses the quality of each generated query, filtering out low-quality queries to conserve computational resources. Finally, an answer generation prompt is used to generate responses to the remaining high-quality queries.
- Training: The model is trained on a combined dataset that includes both the rewritten data from the difference set and the original SFT dataset.
Experiments were conducted using three open-source models: OLMo, MAP-Neo, and Pythia. The models not only release model weights but also training datasets and intermediate checkpoints. OLMo-7B-base, MAP-Neo-7B-base, and Pythia-12B were used as the foundational setup for AITP, with OLMo-7B-SFT and MAP-Neo-7B-SFT-v0.1 serving as baselines. Since the SFT dataset for Pythia has not been released, Tulu-v2 was used for fine-tuning as the baseline for Pythia.
The IFEval benchmark was used to evaluate the model's instruction-following ability, providing accuracy scores such as Prompt-level Strict-accuracy (P-S), Instruction-level Strict-accuracy (I-S), Prompt-level Loose-accuracy (P-L), and Instruction-level Loose-accuracy (I-L). Additional benchmarks included MMLU, ARC-c, GPQA-diamond, HumanEval, MBPP, HellaSwag, and GSM8K.
The results showed that models trained with AITP achieved average performance improvements of 3.77, 1.11, and 0.97 across eight benchmarks compared to the SFT baselines of OLMo, MAP-Neo, and Pythia, respectively. The authors attribute this improvement to AITP supplementing the original SFT dataset with lacking data, expanding its coverage, and optimizing its distribution. The difference set includes data from the pre-training corpus that is lacking in SFT datasets, such as code and scientific literature data. Though the distribution narrows during the rewriting process, the final combined dataset expands the coverage of the original SFT dataset.
Ablation studies were performed to evaluate the impact of dataset size and distillation during the data transformation process on AITP. The results indicated that the AITP method achieves an average absolute improvement of 2.08, even with the same dataset size. The distillation setting did not outperform the OLMo-SFT baseline, suggesting that the improvement does not result from distillation by an aligned model. Further investigation into the effect of incorporating various ratios of rewritten difference data on AITP showed that AITP achieves excellent performance with a rewritten data set comprising less than 10% of the original SFT dataset. However, performance declines as the size of the rewritten set increases. This suggests that incorporating a small amount of rewritten data improves model performance significantly by filling gaps in the original SFT data, while increasing the rewritten ratio degrades overall data quality due to the lower quality of the rewritten data compared to the original SFT dataset.