Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Automated Data Curation for Robust Language Model Fine-Tuning (2403.12776v1)

Published 19 Mar 2024 in cs.CL

Abstract: LLMs have become the de facto approach to sequence-to-sequence text generation tasks, but for specialized tasks/domains, a pretrained LLM lacks specific capabilities to produce accurate or well-formatted responses. Supervised fine-tuning specializes a LLM by training it on dataset of example prompts with target responses, but real-world data tends to be noisy. While many fine-tuning algorithms exist, here we consider a \emph{data-centric AI} perspective on LLM fine-tuning, studying how to \emph{systematically} curate the training dataset to improve the LLM produced via \emph{any} fine-tuning algorithm. We introduce an automated data curation pipeline CLEAR (Confidence-based LLM Evaluation And Rectification) for instruction tuning datasets, that can be used with any LLM and fine-tuning procedure. CLEAR estimates which training data is low-quality and either filters or corrects it. Automatically identifying which data to filter or correct is done via LLM-derived confidence estimates, to ensure only confident modifications to the dataset. Unlike existing data curation techniques, CLEAR is a comprehensive framework that can improve a dataset (and trained model outputs) without additional fine-tuning computations. We don't assume access to a stronger LLM than the model being fine-tuned (e.g.\ relying on GPT-4 when fine-tuning GPT-3.5), to see whether CLEAR can meaningfully improve the capabilities of any LLM. Experiments reveal that CLEAR consistently improves the performance of fine-tuned models across many datasets and models (like GPT-3.5 and Llama2).

PDF Abstract

The paper "Automated Data Curation for Robust LLM Fine-Tuning" (Chen et al., 19 Mar 2024 ) introduces CLEAR (Confidence-based LLM Evaluation And Rectification), an automated pipeline designed to improve the quality of instruction tuning datasets for LLMs. The core idea is a data-centric approach: instead of solely focusing on refining the fine-tuning algorithm, CLEAR systematically improves the dataset used for training. This is particularly relevant in real-world scenarios where instruction tuning data is often noisy, containing inaccurate responses, poor formatting, or irrelevant examples, which can significantly degrade the performance of fine-tuned models.

CLEAR operates in two main stages: Auto-Filter and Auto-Correct. Both stages rely on confidence estimates derived from LLMs to make informed decisions about data quality. The key is to perform these modifications conservatively, ensuring that only confidently low-quality data is removed or confidently better alternatives are used for correction.

The CLEAR Pipeline

The pipeline begins with an original instruction tuning dataset consisting of (prompt, target response) pairs $(x_i, y_i)$ .

Auto-Filter: The first step is to identify and remove low-quality data confidently. This is done before the main fine-tuning process.
Auto-Correct: After an initial fine-tuning phase (preferably on the Auto-Filtered data), the resulting model is used to generate candidate responses for some or all prompts. These candidates are then evaluated against the original target responses, and confidently better candidates replace the original targets in the dataset.
Iterative Improvement: The fine-tuned LLM can be retrained on the Auto-Corrected dataset. This process of fine-tuning and data correction can potentially be iterated to further refine the dataset and model.

This process is illustrated in Figure 1 of the paper, showing the flow from original data through filtering and correction steps, leading to an improved dataset for fine-tuning.

Confidence-Based Evaluation

A critical component of CLEAR is the method for estimating the quality of responses or comparing two responses. The paper highlights that directly prompting an LLM to score response quality (e.g., on a 1-5 scale, as shown in Table 5) can be unreliable. Instead, CLEAR leverages BSDetector (Chen et al., 2023 ), a technique that provides confidence estimates (between 0 and 1) about an LLM's output quality or preference decisions.

BSDetector works by considering two factors:

Observed Consistency: The LLM generates multiple candidate responses for the same prompt (e.g., via temperature sampling). Confidence is higher if the target response is semantically similar to these diverse generations.
Self-Reflection Certainty: The LLM is also prompted to directly evaluate the target response and report its confidence.

These factors are combined to produce a single confidence score. This approach is model-agnostic, working with any LLM (including black-box APIs like GPT-3.5/4), and doesn't require access to model parameters or specific training. The paper's experiments (Figure 2, Table 3) show that this confidence-based approach is more effective at identifying low-quality data than direct LLM scoring.

Implementing Auto-Filter

The Auto-Filter stage aims to create a cleaner subset of the original dataset for initial fine-tuning.

Implementation Steps:

Confidence Estimation: For every pair

(x_i, y_i)

in the original dataset, use the base pre-trained LLM and the BSDetector method to compute a confidence score

c_i

that

y_i

is a high-quality response for

x_i

from bsdector import BSDetector # Assuming a library implementation

base_LLM = load_base_LLM() # Load your base LLM or configure API access
bsdetector = BSDetector(base_LLM)

dataset = load_instruction_tuning_data() # List of (prompt, response) tuples

confidence_scores = []
for prompt, response in dataset:
    confidence = bsdetector.estimate_quality_confidence(prompt, response)
    confidence_scores.append(confidence)

# Store scores, perhaps alongside data: [(prompt_i, response_i, confidence_i), ...]
annotated_dataset = [(dataset[i][0], dataset[i][1], confidence_scores[i]) for i in range(len(dataset))]

Set Threshold: Determine a confidence threshold $\gamma$ $γ$ . The paper uses the median confidence score of the dataset as a simple heuristic. Alternatively, $\gamma$ $γ$ could be tuned on a small validation set or set based on manual inspection of examples around different confidence levels.
1 2 3 4
import numpy as np # Example: Using the median confidence as threshold gamma = np.median(confidence_scores)
Filter Data: Create the filtered dataset $F$ $F$ by keeping only the pairs where $c_i > \gamma$ $c_{i} > γ$ .
1
filtered_dataset = [(p, r) for p, r, c in annotated_dataset if c > gamma]
Fine-tune: Fine-tune the LLM on the filtered_dataset. This is the first fine-tuning pass.
1
finetune_LLM(base_LLM, filtered_dataset) # Use your fine-tuning script/API

Practical Considerations for Auto-Filter:

Computational Cost: Running BSDetector involves multiple LLM calls per data point, which can be expensive, especially for large datasets and expensive models (like GPT-4). Optimizing BSDetector calls or using a cheaper base model for this stage might be necessary.
Threshold $\gamma$ : Setting $\gamma$ is a trade-off. A high $\gamma$ removes more potentially noisy data but also reduces the total training set size. A low $\gamma$ retains more data but includes more noise. The median heuristic is simple but may not be optimal for all datasets.
Base LLM Choice: Using the same base LLM for BSDetector as is being fine-tuned ensures the confidence estimates are relevant to the model's capabilities, which is a key aspect of the paper's methodology.

Implementing Auto-Correct

The Auto-Correct stage aims to improve low-quality examples that were either filtered or retained but flagged as potentially problematic.

Implementation Steps:

Generate Candidate Responses: Use the LLM fine-tuned on the Auto-Filtered data (or the original data if skipping Auto-Filter) to generate a candidate response

y'_i

for prompts

x_i

. This is done for examples where the original response

y_i

was flagged as low confidence (e.g.,

c_i \le \gamma

). The paper shows benefits in using the fine-tuned model for this step compared to the base model (Table 4).

finetuned_LLM = load_finetuned_LLM() # Load the model trained on filtered_dataset

corrected_dataset = []
# Iterate through original dataset or the filtered-out portion
for prompt, original_response, confidence in annotated_dataset:
    if confidence <= gamma: # Or some other criteria for potential correction
        candidate_response = finetuned_LLM.generate(prompt)
        # Keep track of original and candidate for comparison
        corrected_dataset.append((prompt, original_response, candidate_response))
    else:
         # Keep high-confidence examples as they are
         corrected_dataset.append((prompt, original_response, original_response)) # No correction needed

Evaluate Candidate vs. Original: For examples where a candidate response

y'_i

was generated, use an LLM-as-judge approach to determine if

y'_i

is better than

y_i

. The paper uses the base LLM and the prompt from Table 1 for this evaluation. BSDetector is then used to estimate the confidence

\hat{c}_i

in this judgment (i.e., confidence that the judge's verdict is correct, specifically confidence that

y'_i

is indeed better than

y_i

# Assuming annotated_dataset now has (prompt, original_response, candidate_response) tuples for examples potentially corrected
final_dataset_for_finetuning = []

for prompt, original_response, candidate_response in corrected_dataset:
    if original_response == candidate_response: # No correction attempted or needed
        final_dataset_for_finetuning.append((prompt, original_response))
    else:
        # Use base LLM as judge with the prompt from Table 1
        # And use BSDetector to get confidence in the judge's preference
        judge_output = base_LLM.judge(prompt, original_response, candidate_response) # Pseudo-code for invoking judge
        
        # Use BSDetector to get confidence in the judge's verdict (e.g., [[B]] meaning candidate is better)
        # This step might require adapting BSDetector to evaluate confidence in preference judgments,
        # which is mentioned in Chen and Mueller (2023) [2308.16175].
        preference_confidence = bsdetector.estimate_preference_confidence(
            prompt, original_response, candidate_response, judge_output
        )
        
        # Set threshold eta (paper uses eta=0.8)
        eta = 0.8
        
        if judge_output == "[[B]]" and preference_confidence > eta: # [[B]] means candidate response is better
            final_dataset_for_finetuning.append((prompt, candidate_response)) # Use corrected response
        else:
            # If candidate is not confidently better, either keep original or filter entirely
            # Paper's Auto-Correct section implies replacing only if confidently better.
            # If not confidently better, the original example might be filtered entirely
            # or kept with its original response, depending on the pipeline variant.
            # Based on Figure 3, if not confidently better, the example is filtered.
            # This means the Auto-Corrected dataset might be a subset of the original.
            # Let's assume filtering if not confidently better for simplicity based on Figure 3 logic.
            # A more complex implementation might track original confidence and decide whether to filter or keep original.
            print(f"Example filtered out: Prompt='{prompt[:50]}...', Original Conf: {confidence}, Preference Conf: {preference_confidence}")
            pass # This example is filtered out

(Note: The exact BSDetector function estimate_preference_confidence is conceptual here based on the paper's description of BSDetector estimating confidence for preference predictions)

Fine-tune (Again): Fine-tune the LLM on the resulting final_dataset_for_finetuning. This dataset contains high-confidence original examples and examples where the original response was replaced by a confidently better LLM-generated candidate.
1
finetune_LLM(finetuned_LLM, final_dataset_for_finetuning) # Retrain on the refined dataset

Practical Considerations for Auto-Correct:

Computational Cost: Generating candidate responses requires LLM calls. The LLM-as-judge step and its BSDetector confidence estimation also add computational overhead.
Threshold $\eta$ : The threshold $\eta$ controls how aggressively corrections are applied. A higher $\eta$ means fewer corrections but higher confidence in the changes.
Choice of LLM for Correction: Using the fine-tuned LLM to generate candidates is beneficial because it is specialized to the domain. Using the base LLM as a judge provides a more objective assessment, less influenced by the fine-tuned model's potential biases or errors.
Iterative Refinement: The paper suggests the process can be iterated. Each iteration might use the newly fine-tuned model to generate candidates for the next round of correction. The number of iterations is a practical hyperparameter.

Real-World Applications and Implications

CLEAR's practical value lies in its ability to systematically improve data quality for instruction tuning without requiring manual data annotation or relying on stronger, potentially unavailable or expensive, teacher models (like GPT-4 for fine-tuning Llama2). This makes it applicable in scenarios where:

Domain-Specific Fine-tuning: Datasets are collected for niche domains where generic powerful models might not perform well, and creating high-quality data manually is expensive.
Noisy Public Datasets: Fine-tuning on publicly available datasets that are known to contain errors or inconsistencies (like datasets scraped from the web or user interactions).
Improving Existing Models: Enhancing the performance of an already fine-tuned model by curating a better dataset for subsequent training rounds.
Data Scarcity (relative): While filtering removes data, the Auto-Correct stage attempts to salvage potentially useful prompts by fixing responses, mitigating the impact of simple filtering alone.

The paper's results across SQUAD-N, Emails-N, and DROP-N datasets (Tables 2, 3, 4) demonstrate consistent improvements in both response accuracy and format adherence (Valid JSON %). This highlights that investing in data curation via methods like CLEAR can be more impactful than solely focusing on model or algorithm changes, aligning with the principles of data-centric AI.

Limitations

A key limitation mentioned in the paper is that CLEAR does not explicitly account for or mitigate biases present in the original dataset. If the training data contains harmful biases, the fine-tuned model might perpetuate or even amplify them, even with corrected responses, as the underlying patterns of bias might remain in the data distribution or be introduced by the LLM judge or generator. This is a crucial consideration for deploying models trained with CLEAR in sensitive applications.

In summary, the CLEAR pipeline offers a practical, automated framework for enhancing the quality of instruction tuning data for LLMs by leveraging LLM-derived confidence scores to filter and correct data. Its model-agnostic nature and ability to improve models without relying on stronger teacher models make it a valuable tool for practitioners dealing with real-world, noisy datasets for specialized LLM tasks. Implementing CLEAR involves integrating confidence estimation (like BSDetector) and LLM-as-judge components into a standard fine-tuning workflow, considering the associated computational costs and the tuning of confidence thresholds.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Jiuhai Chen (26 papers)
Jonas Mueller (36 papers)

Citations (16)

View on Semantic Scholar

Automated Data Curation for Robust Language Model Fine-Tuning (2403.12776v1)

Related Papers