AI Research Assistant for Computer Scientists
Discover and learn about the latest research in LLMs, agents, robotics, and more
Abstract: “In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied by limited evaluation, making it difficult to compare models across the board and determine the utility of various resources. We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets ranging from manually curated (e.g., OpenAssistant) to synthetic and distilled (e.g., Alpaca) and systematically evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities through a collection of automatic, model-based, and human-based metrics. We further introduce T\"ulu, our best performing instruction-tuned model suite finetuned on a combination of high-quality open resources. Our experiments show that different instruction-tuning datasets can uncover or enhance specific skills, while no single dataset (or combination) provides the best performance across all evaluations. Interestingly, we find that model and human preference-based evaluations fail to reflect differences in model capabilities exposed by benchmark-based evaluations, suggesting the need for the type of systemic evaluation performed in this work. Our evaluations show that the best model in any given evaluation reaches on average 87% of ChatGPT performance, and 73% of GPT-4 performance, suggesting that further investment in building better base models and instruction-tuning data is required to close the gap. We release our instruction-tuned models, including a fully finetuned 65B T\"ulu, along with our code, data, and evaluation framework at https://github.com/allenai/open-instruct to facilitate future research.”
Overview of "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources"
This paper provides a comprehensive evaluation of instruction-tuning ° language models on a diverse range of publicly available datasets. The authors aim to assess whether open models can rival proprietary counterparts such as ChatGPT and GPT-4 ° by systematically evaluating their factual knowledge, reasoning, multilinguality, coding abilities, safety, and open-ended instruction-following skills.
The authors introduce a broad set of instruction-tuned models, varying in size from 6.7B to 65B parameters, trained on 12 distinct instruction datasets ° including both manually curated and synthetic/distilled data. One of the key contributions is the release of the 65B T figures/tulu_logo.png model suite, fine-tuned using a combination of high-quality open resources.
Key Findings
- Dataset Impact: The paper reveals that specific datasets enhance corresponding skills, with no single dataset excelling across all evaluated facets. For instance, datasets like CoT ° were particularly effective for reasoning tasks, while Code-Alpaca improved coding abilities.
- Base Model ° Importance: Higher quality base models ° exhibit superior performance post instruction-tuning. Larger models or those trained on more extensive datasets outperformed their smaller counterparts.
- Evaluation Discrepancies: Intriguingly, models that performed best in model-based evaluations ° did not always score highest in benchmark-based evaluations, indicating potential biases ° in model preference-based assessments.
- Competitive Gap: The top-performing models achieved approximately 87% of ChatGPT's performance and 73% of GPT-4's, underscoring the need for improved base models and more comprehensive instruction datasets to close this gap.
Technical Insights
- Model Development: The paper investigates the model's capacity to synthesize knowledge across multiple domains using open resources. The T figures/tulu_logo.png models, trained on the Human+GPT dataset, showcase improved performance, highlighting the efficacy of diverse data mixes.
- Safety and Truthfulness: Evaluations on ToxiGen ° and TruthfulQA ° metrics underscore the models' varying tendencies in generating toxic content ° or misinformation, with larger models often yielding safer outputs.
- Model Assessment: The paper emphasizes the role of systemic evaluation across different capabilities and cautions against reliance solely on model-based preferences, which appear sensitive to token diversity rather than nuanced comprehension.
Implications and Future Directions
This work exposes the strengths and limitations of current open models and datasets, suggesting several future research directions:
- Enhanced Model Training: Increasing the diversity and coverage of instruction datasets could foster models that better generalize across varied tasks.
- Bias Mitigation: Developing evaluation frameworks that minimize biases and provide a more accurate reflection of model capabilities could advance instructional tuning methodologies.
- Open Research: By open-sourcing the T figures/tulu_logo.png models, the authors aim to facilitate further exploration and improvements in instruction-tuning paradigms.
In conclusion, this paper sheds light on how instruction tuning ° with open resources can potentially approach proprietary models’ capabilities, albeit with clear gaps ° remaining. The detailed analysis and released models serve as a foundation for continued advancements in open-source AI development °.
- Yizhong Wang ° (41 papers)
- Hamish Ivison ° (12 papers)
- Pradeep Dasigi ° (28 papers)
- Jack Hessel ° (50 papers)
- Tushar Khot ° (51 papers)
- Khyathi Raghavi Chandu ° (23 papers)
- David Wadden ° (24 papers)
- Kelsey MacMillan ° (4 papers)
- Noah A. Smith ° (209 papers)
- Iz Beltagy ° (39 papers)
- Hannaneh Hajishirzi ° (164 papers)