Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ada-Instruct: Adapting Instruction Generators for Complex Reasoning (2310.04484v3)

Published 6 Oct 2023 in cs.CL and cs.AI
Ada-Instruct: Adapting Instruction Generators for Complex Reasoning

Abstract: Instructions augmentation is a crucial step for unleashing the full potential of LLMs in downstream tasks. Existing Self-Instruct methods primarily simulate new instructions from a few initial instructions with in-context learning. However, our study identifies a critical flaw in this approach: even with GPT4o, Self-Instruct cannot generate complex instructions of length $\ge 100$, which is necessary in complex tasks such as code completion. To address this issue, our key insight is that fine-tuning open source LLMs with only ten examples can produce complex instructions that maintain distributional consistency for complex reasoning tasks. We introduce Ada-Instruct, an adaptive instruction generator developed through fine-tuning. We empirically validated Ada-Instruct's efficacy across different applications. The results highlight Ada-Instruct's capacity to generate long, intricate, and distributionally consistent instructions.

Ada-Instruct: Adapting Instruction Generators for Complex Reasoning

The paper "Ada-Instruct: Adapting Instruction Generators for Complex Reasoning" authored by Wanyun Cui and Qianle Wang, introduces a novel approach to instruction generation using fine-tuning rather than in-context learning (ICL). This work addresses the limitations of existing ICL-based methodologies in generating long and complex instructions for various reasoning tasks, such as code completion and mathematical reasoning.

Problem Statement and Background

The prevailing approaches for instruction generation leverage closed-source LLMs and rely heavily on ICL. While effective in many scenarios, ICL struggles to generate instructions with increased complexity and length (≥100 tokens), which is crucial for sophisticated tasks like code completion. The authors highlight that existing self-instruct methodologies based on ICL fall short in maintaining the required distributional consistency for downstream tasks.

Key Contributions

  1. Fine-Tuning for Instruction Generation: The paper introduces Ada-Instruct, an adaptive instruction generator that uses fine-tuning rather than ICL. The authors demonstrate that even with as few as ten samples, fine-tuning open-source LLMs can produce long and complex instructions that align well with the target distribution of downstream tasks.
  2. Empirical Validation Across Applications: Ada-Instruct's efficacy is empirically validated across diverse applications, including code completion (HumanEval, MBPP), mathematical reasoning (GSM8k, MATH), and commonsense reasoning (CommonsenseQA). The results reveal significant improvements over the base models and current state-of-the-art methods.

Numerical Results and Key Findings

  1. Code Completion:
    • On the HumanEval benchmark, Ada-Instruct achieves a pass@1 score of 64.0%, showing a relative improvement of 47.8% over the 13B parameter base model Code LLAMA-Python.
    • Comparisons with other state-of-the-art models like WizardCoder and Code LLAMA-Instruct reveal Ada-Instruct’s competitive performance, achieved with fewer initial and fine-tuning data points.
  2. Mathematical Reasoning:
    • Ada-Instruct attains a pass@1 score of 48.6% on GSM8k, improving the performance of the 13B model by 69.3%.
    • It achieves a significant enhancement on the MATH benchmark, highlighting Ada-Instruct's capacity to handle more challenging tasks.
  3. Commonsense Reasoning:
    • On the CommonsenseQA benchmark, Ada-Instruct achieves an accuracy of 75.5%, demonstrating a substantial relative improvement of 28.0% over its base model.

Analysis of Instruction Generation

Task Creativity:

Utilizing t-SNE to visualize the instruction distributions, the authors show that Ada-Instruct generates diverse and expansive instructions that align closely with the actual task distribution, going beyond the oversampling of initial training samples.

Quality and Diversity:

By leveraging ChatGPT for annotation, the paper demonstrates that the quality of the instructions generated by Ada-Instruct approximates that of real samples. Despite a small fraction of incorrect samples, the performance degradation is minimal, illustrating the robustness of Ada-Instruct in generating high-quality instructions.

Implications and Future Directions

Practical Implications: - The findings advocate for the use of fine-tuning over ICL, especially for tasks requiring complex and long-form instructions. - Ada-Instruct's methodology proves cost-effective, utilizing open-source LLMs and reducing reliance on expensive closed-source models (e.g., ChatGPT).

Theoretical Implications: - The work challenges the conventional wisdom that ICL is superior for out-of-distribution generalization, instead demonstrating the strong potential of fine-tuning in this domain. - The approach presented in Ada-Instruct could pave the way for refining instruction generation strategies, particularly in scenarios where data sparsity and diversity present significant challenges.

Future Developments: - Further research could explore the extension of Ada-Instruct to additional domains and task types, potentially broadening the applicability of fine-tuning-based instruction generation. - Investigations into optimizing the fine-tuning process, possibly incorporating hybrid approaches that blend the strengths of both ICL and fine-tuning, are promising avenues for future exploration.

In conclusion, the Ada-Instruct framework offers a robust alternative to ICL for generating complex instructions, thereby enhancing the adaptability and versatility of LLMs across a variety of reasoning tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  3. Exploring the landscape of distributional robustness for question answering models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  5971–5987, 2022.
  4. Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation, 2023.
  5. An empirical survey of data augmentation for limited data learning in nlp. Transactions of the Association for Computational Linguistics, 11:191–211, 2023.
  6. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  9. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056, 2023.
  10. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
  11. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  12. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
  13. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  14. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  15. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023a.
  16. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023b.
  17. Generating training data with language models: Towards zero-shot language understanding. Advances in Neural Information Processing Systems, 35:462–477, 2022.
  18. Tuning language models as training data generators for augmentation-enhanced few-shot learning. In International Conference on Machine Learning, pp. 24457–24477. PMLR, 2023.
  19. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. arXiv preprint arXiv:2305.16938, 2023.
  20. R OpenAI. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023.
  21. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  22. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  23. Generating datasets with pretrained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  6943–6951, 2021.
  24. Prompting gpt-3 to be reliable. In The Eleventh International Conference on Learning Representations, 2022.
  25. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
  26. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023.
  27. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149–4158, 2019.
  28. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  30. Avoiding inference heuristics in few-shot prompt-based finetuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  9063–9074, 2021.
  31. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  32. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  33. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
  34. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  35. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  36. Zerogen: Efficient zero-shot learning via dataset generation. arXiv preprint arXiv:2202.07922, 2022.
  37. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Wanyun Cui (16 papers)
  2. Qianle Wang (3 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com