#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models

Published 14 Aug 2023 in cs.CL, cs.AI, and cs.LG | (2308.07074v2)

Abstract: Foundation LLMs obtain the instruction-following ability through supervised fine-tuning (SFT). Diversity and complexity are considered critical factors of a successful SFT dataset, while their definitions remain obscure and lack quantitative analyses. In this work, we propose InsTag, an open-set fine-grained tagger, to tag samples within SFT datasets based on semantics and intentions and define instruction diversity and complexity regarding tags. We obtain 6.6K tags to describe comprehensive user queries. Then we analyze popular open-sourced SFT datasets and find that the model ability grows with more diverse and complex data. Based on this observation, we propose a data selector based on InsTag to select 6K diverse and complex samples from open-source datasets and fine-tune models on InsTag-selected data. The resulting models, TagLM, outperform open-source models based on considerably larger SFT data evaluated by MT-Bench, echoing the importance of query diversity and complexity. We open-source InsTag in https://github.com/OFA-Sys/InsTag.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (48)

View on Semantic Scholar

Summary

The paper introduces InsTag as an innovative system using ChatGPT to automatically tag supervised fine-tuning datasets for LLMs.
It employs a robust normalization process to address lexical inconsistencies, granularity issues, and spurious correlations.
Experiments across 17 datasets show that models fine-tuned with InsTag-selected samples achieve improved alignment with human expectations.

Overview of Instruction Tagging for Analyzing SFT of LLMs

The paper presents a detailed investigation into the supervised fine-tuning (SFT) of LLMs using a novel methodology known as InsTag. InsTag is an automated instruction tagging approach designed to categorize samples within SFT datasets by leveraging semantics and intentions, enabling a comprehensive analysis of instruction diversity and complexity. This framework is particularly useful for enhancing the alignment of LLMs with human preferences by refining the SFT processes.

Methodology and Implementation

The authors introduce InsTag as an open-set, fine-grained tagging system that utilizes ChatGPT to annotate samples based on comprehensive tags that reflect user queries' diversity and complexity. Through this approach, around 6,600 unique tags are generated to offer a nuanced understanding of datasets, which is critical for effective SFT.

The paper details three primary levels of noise that can affect these tagging results: lexical inconsistencies, uncontrolled granularity, and spurious correlations. To address these challenges, a robust normalization process is implemented, ensuring that the tag lexicon is streamlined and consistent. This process involves frequency filtering, rule aggregation, semantic aggregation through embeddings, and association aggregation using algorithms like FP-Growth.

Experimental Evaluation

InsTag is applied to analyze several open-source SFT datasets, revealing insights into the relationship between dataset diversity, complexity, and model performance. The research employs over 17 datasets, including ShareGPT, OpenChat, UltraChat, and WizardLM, which showcase varied levels of diversity and complexity.

One salient finding is that models fine-tuned on more diverse and complex datasets exhibit improved alignment with human expectations, particularly when using InsTag-selected samples. For instance, models such as TagLM, trained on InsTag-curated data, demonstrated superior performance compared to other open-source models, despite being trained on significantly smaller datasets.

Analysis and Implications

The study provides key metrics derived from tagging results, such as diversity (the scope of intentions represented) and complexity (the number of intentions embedded in queries). These metrics are shown to correlate with improved alignment. Tag coverage and average tag numbers effectively illustrate the dataset's qualitative aspects, confirming the importance of diverse and complex SFT data.

In practical applications, InsTag can inform the construction of SFT datasets, providing guidance for selecting diverse samples to maximize model usability and efficiency. The approach not only assists in data selection but also promises enhancements in alignment techniques, paving the way for more sophisticated model training and evaluation frameworks.

Conclusion and Future Work

The paper underscores the utility of InsTag as a sophisticated tool for unraveling the dynamics of SFT datasets in LLMs. Moreover, the insights gained can drive the development of more robust methods for measuring and improving model alignment. Further exploration could see InsTag applied beyond data selection, encompassing comprehensive evaluation and self-instruct methods to enhance LLM capabilities across varied tasks.

Overall, this research provides a rigorous framework for understanding and implementing supervised fine-tuning in LLMs, contributing valuable insights into the semantic and functional depth necessary for effective AI alignment.

Markdown Report Issue