Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

447

LESS: Selecting Influential Data for Targeted Instruction Tuning (2402.04333v3)

Published 6 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Instruction tuning has unlocked powerful capabilities in LLMs, effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to effectively estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.

PDF HTML Abstract

LESS: An Efficient Algorithm for Targeted Instruction Tuning in LLMs

Introduction to LESS

LLMs have gained significant traction for their ability to serve as general-purpose chatbots, capable of generating human-like text based on provided instructions. However, for real-world applications that demand specialized capabilities, such as advanced reasoning, the challenge of sifting through extensive instruction tuning datasets to identify and utilize the most relevant data becomes apparent. This process, termed "targeted instruction tuning," is crucial for developing specific skills within LLMs without having to train on the entire dataset, which may contain irrelevant or even counterproductive information.

The proposed solution to this challenge is the algorithm LESS (Low-rank gradiEnt Similarity Search), which represents a novel method for selecting influential data from large instruction tuning datasets. LESS operates by effectively estimating data influences using optimizer-aware formulations and performing a low-rank gradient similarity search to pinpoint the examples most pertinent to enhancing the model's performance on a given task.

LESS: The Underlying Mechanism

Compatibility with Instruction Tuning

At its core, LESS modifies existing influence estimation methods to work efficiently with the Adam optimizer and manage variable-length instruction data. These adaptations are crucial given that LLMs often use Adam for fine-tuning due to its ability to handle sparse gradients and adjust learning rates automatically.

Efficiency Through LoRA and Random Projections

To address the computational and storage overhead associated with large model parameters, LESS employs LoRA (Low-Rank Adaptations) and random projection techniques to construct a gradient datastore. This datastore, consisting of low-dimensional gradient features, allows for efficient and effective dataset selection while being reusable for new target tasks, thus significantly reducing the computational cost.

Transferable Knowledge Across Models

A significant advantage of LESS is its ability to select data using gradients from smaller models to induce strong performance in larger models or even different model families. This transferability is crucial for practical applications where computational resources may be limited.

Interpretable Data Selection

LESS diverges from traditional methods that often rely on surface form cues for data selection. Instead, it focuses on identifying data that showcases similar reasoning and skill types required for the target task. This approach ensures that the selected data aligns more closely with the specific capabilities being targeted, rather than merely matching on language or topic.

Experimental Findings and Implications

The effectiveness of LESS is demonstrated through experiments on diverse downstream tasks, where training on only a 5% subset of data selected by LESS often outperforms training on the full dataset. This outcome underscores the potential for LESS to enable more focused and efficient training protocols, especially in scenarios where dataset size significantly outstrips the in-domain data necessary for specialized tasks.

Additionally, the ability of LESS to select transferable data across models introduces a promising avenue for reducing the computational costs associated with data selection and model training. Smaller models can be utilized to curate training datasets for larger, more complex models, facilitating a more resource-efficient workflow without compromising performance.

The Road Ahead

While LESS presents a significant advance in targeted instruction tuning for LLMs, several avenues remain open for further exploration. These include extending LESS for real-time model adaptation, optimizing the algorithm for even greater efficiency, and investigating its potential for reducing unintended model biases by selectively focusing on data that promotes fairness and inclusivity.

In summary, LESS stands as a testament to the potential of intelligent data selection in unlocking more specialized and efficient capabilities within the field of LLMs, paving the way for their broader application across a myriad of tasks demanding high degrees of specificity and complexity.

PDF Markdown Bookmark Chat (Pro)

References (80)

Authors (5)

Mengzhou Xia (34 papers)
Sadhika Malladi (17 papers)
Suchin Gururangan (29 papers)
Sanjeev Arora (93 papers)
Danqi Chen (84 papers)

Citations (111)

View on Semantic Scholar

Tweets

https://twitter.com/xiamengzhou/status/1757832742903943215

https://twitter.com/SadhikaMalladi/status/1823796677524033703

https://twitter.com/ZhengxiangShi/status/1794759424906568078

https://twitter.com/SadhikaMalladi/status/1759640028039553089

https://twitter.com/_aloobun/status/1864993943760384134

https://twitter.com/fly51fly/status/1757897264012607521