Selection of LLM Fine-Tuning Data based on Orthogonal Rules

Published 7 Oct 2024 in cs.CL, cs.AI, and cs.LG | (2410.04715v3)

Abstract: High-quality training data is critical to the performance of LLMs. Recent work has explored using LLMs to rate and select data based on a small set of human-designed criteria (rules), but these approaches often rely heavily on heuristics, lack principled metrics for rule evaluation, and generalize poorly to new tasks. We propose a novel rule-based data selection framework that introduces a metric based on the orthogonality of rule score vectors to evaluate and select complementary rules. Our automated pipeline first uses LLMs to generate diverse rules covering multiple aspects of data quality, then rates samples according to these rules and applies the determinantal point process (DPP) to select the most independent rules. These rules are then used to score the full dataset, and high-scoring samples are selected for downstream tasks such as LLM fine-tuning. We evaluate our framework in two experiment setups: (1) alignment with ground-truth ratings and (2) performance of LLMs fine-tuned on the selected data. Experiments across IMDB, Medical, Math, and Code domains demonstrate that our DPP-based rule selection consistently improves both rating accuracy and downstream model performance over strong baselines.

Abstract PDF HTML Upgrade to Chat

Authors (5)

Summary

The paper introduces a fully automated pipeline for rule-based data selection, establishing a novel orthogonality metric that guides the curation of diverse, high-quality fine-tuning data.
It leverages LLMs for generating and scoring candidate rules and employs determinantal point process sampling to ensure statistical independence among rule sets.
Empirical results across sentiment, medical, and code domains demonstrate that DPP-driven selection outperforms static and rule-free baselines in downstream performance.

Rule Orthogonality-Based Selection for LLM Fine-Tuning Data

Motivation and Context

Fine-tuning LLMs critically depends on training data quality. Empirical results demonstrate that carefully selected data subsets often outperform larger, less curated corpora in downstream tasks. Despite considerable recent interest in rule-based rating—utilizing either human-crafted or LLM-generated heuristics to judge data quality—existing methods typically rely on ad hoc human intuition and offer limited mechanisms for principled rule evaluation or selection. Prior frameworks (e.g., QuRating, Constitutional AI) often employ static or randomly sampled rule sets, which consequently introduce redundancy, fail to generalize across domains, and may miss complementary signals necessary for robust data quality assessment.

This work introduces a fully automated pipeline for rule-based data selection, establishing a rigorous mathematical foundation for evaluating and selecting rule sets based on their orthogonality—i.e., the statistical independence of rule score vectors. Leveraging LLMs for both rule generation and data rating, the method employs determinantal point process (DPP) sampling to construct compact, diverse sets of rules tailored to each downstream task.

Figure 1: Pipeline for rule-based data rating and selection in five steps, integrating automated rule generation, scoring, DPP-based rule selection, stochastic data sampling, and application to fine-tuning.

Methodological Framework

Rule Generation and Rating

The framework begins by prompting LLMs (typically GPT-4) with dataset and task metadata, automatically generating $R$ candidate rules to cover a broad spectrum of data quality dimensions. This set is then minimally filtered for syntactic clarity and deduplication. Unlike static approaches, the method encourages rule diversity and coverage across semantic and domain-specific axes.

Each candidate rule is then used in conjunction with an LLM (e.g., Llama3-8B-Instruct) to rate a batch of $n$ examples, producing a score matrix $S \in \mathbb{R}^{n \times R}$ , where $S_{i,j} \in [0,1]$ indicates the score assigned by rule $j$ to sample $i$ .

Orthogonality Metric and Rule Selection

Central to the pipeline is the orthogonality metric measuring the independence of rule scores. For any submatrix $S'$ (choosing $r$ rules), empirical rule correlation is defined as:

$\rho(S') = \frac{1}{r} \left\|\widehat{C}(S') - I_r\right\|_F$

where $\widehat{C}(S')$ is the sample correlation matrix of the selected rules and $\|\cdot\|_F$ denotes the Frobenius norm. This quantifies deviation from perfect orthogonality (identity correlation). Statistically, the paper proves $\rho(S')$ concentrates around its population value given sufficient $n$ , validating batch estimation for rule selection.

Selecting the most diverse rule subset is NP-hard; therefore, the method leverages DPP sampling over the Gram matrix of score vectors. The DPP promotes selection of rule subsets with maximal volume (i.e., statistical diversity), efficiently computable for moderate $R$ .

Data Selection and Sampling

After orthogonal rule selection, all data samples are rated using these $r$ rules. Aggregated sample scores (typically mean) yield a distribution over $\mathcal{D}$ . Rather than hard top- $k$ selection, stochastic sampling is applied (softmax over scores with temperature parameter $\tau$ ), introducing further diversity and mitigating bias toward outlier high-scoring samples.

Downstream Application

Fine-tuning is performed using the selected subset on domain-specific tasks (IMDB sentiment, Medical MMLU, GSM8K math, and code generation). The pipeline is validated across multiple LLM backbones (Pythia-1B, Llama3-8B with LoRA) and compared to a suite of competitive baselines, including rule-free (Uniform, DSIR, LESS), prior rule-based methods (QuRating, GPT-Uncorrelated), and AllData.

Empirical Evaluation and Results

Ground-Truth Alignment and Ablation

Experiments demonstrate strong positive correlations between rule diversity (low $\rho$ ) and alignment with human ground-truth ratings. Out of $10^6$ sampled rule subsets, even random domain-specific rules outperform static, general-purpose sets (e.g., QuRating). DPP-selected rules achieve near-maximal accuracy, highlighting the importance of adaptive, task-aware rule selection.

Fine-Tuning Performance

Consistently, models trained on DPP-selected data subsets outperform all baselines across all domains. Notably:

On IMDB sentiment, Medical MMLU, Math, and Code tasks, DPP selection yields higher accuracy than DSIR, LESS, QuRating, and GPT-Uncorrelated.
Statistical significance is confirmed via $t$ -tests across multiple independent trials.
DPP-based rule subsets have demonstrably lower average pairwise correlation compared to random selection or LLM-generated "uncorrelated" rules.

Qualitative Effects

Distributional analyses of selected samples show DPP methods favor greater domain and semantic diversity (e.g., concentration of code-related sources for code tasks, increased text length and bigram entropy for review tasks). This reflects improved alignment between rule-selected data and the downstream task, absent in rule-free or static rule-based baselines.

Robustness and Generality

The method generalizes well: results hold across data sizes (with diminishing returns as quantity increases—supporting "less is more" claims), backbone architectures, and domains. Additional experiments confirm performance stability when varying $r$ within reasonable bounds; rule orthogonality consistently tracks rating accuracy.

Practical and Theoretical Implications

Automated Rule Evaluation

This work introduces the first rigorous, fully-automated metric for rule evaluation in LLM-as-a-judge pipelines, enabling systematic study of rule diversity, redundancy, and downstream task adaptation. The mathematical relationship between orthogonal rule selection and robust data rating is formalized and empirically validated.

Generalization Beyond Fine-Tuning

The framework is modular—integrable into pre-training data selection, RLHF pipelines, and annotation workflows. By supporting automated rule generation and domain-tailored selection, it can facilitate scalable, interpretable datasets across diverse modalities and tasks.

Future Directions

Promising extensions include dynamic or weighted rule selection (enabled by orthogonality metrics), application to multi-modal settings, interactive RLHF annotation, and integration with causal or distributional shift detection methods. There is potential for further theoretical development in maximizing data utility for few-shot generalization and curriculum learning in LLMs.

Conclusion

By combining LLM-driven rule generation, rigorous orthogonality-based evaluation, and efficient DPP subset selection, this pipeline advances the state of the art in data selection for LLM fine-tuning. The method surpasses both static rule sets and rule-free baselines in rating accuracy and downstream task performance, generalizes across domains and model architectures, and provides a principled foundation for scalable, interpretable, and efficient LLM training data selection (2410.04715).

Markdown Report Issue