Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLMCRIT: Teaching Large Language Models to Use Criteria

Published 2 Mar 2024 in cs.CL | (2403.01069v2)

Abstract: Humans follow criteria when they execute tasks, and these criteria are directly used to assess the quality of task completion. Therefore, having models learn to use criteria to provide feedback can help humans or models to perform tasks better. However, existing research in this field tends to consider only a limited set of criteria or quality assessment aspects. To fill this gap, we propose a general framework that enables LLMs to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. We choose three tasks from real-world scenarios to operationalize this idea: paper introduction writing, Python code writing, and Reddit post writing, and evaluate our feedback generation framework using different LLMs. The results reveal the fine-grained effects of incorporating criteria and demonstrations and provide valuable insights on how to teach LLMs to use criteria more effectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. 2008–2021. Grobid. https://github.com/kermitt2/grobid.
  2. John R. Anderson. 1990. The adaptive character of thought.
  3. Anthropic. 2023. Claude 2. https://www.anthropic.com/index/claude-2.
  4. Constitutional AI: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  5. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540.
  6. Large language models as tool makers. CoRR, abs/2305.17126.
  7. Long alpaca: Long-context instruction-following models. https://github.com/dvlab-research/LongLoRA.
  8. Cohere. 2023. World-class ai, at your command. https://cohere.com/models/command.
  9. Ultrafeedback: Boosting language models with high-quality feedback.
  10. Oxford English Dictionary. September 2023. [link].
  11. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  12. Self-consistency for open-ended generations.
  13. Specific versus general principles for constitutional ai.
  14. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  15. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
  16. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  17. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. CoRR, abs/2303.16434.
  18. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  19. Self-refine: Iterative refinement with self-feedback.
  20. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124.
  21. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  22. Code llama: Open foundation models for code.
  23. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.
  24. Principle-driven self-alignment of language models from scratch with minimal human supervision. CoRR, abs/2305.03047.
  25. Together. 2023. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api. https://together.ai/blog/llama-2-7b-32k-instruct.
  26. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, Toronto, Canada. Association for Computational Linguistics.
  27. Shepherd: A critic for language model generation.
  28. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.
  29. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  30. Align on the fly: Adapting chatbot behavior to established norms.
  31. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685.
  32. Judging llm-as-a-judge with mt-bench and chatbot arena.
  33. Large language models can learn rules.
Citations (3)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 4 likes about this paper.