Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic Evaluation of Generative Models with Instruction Tuning (2310.20072v1)

Published 30 Oct 2023 in cs.CL and cs.LG

Abstract: Automatic evaluation of natural language generation has long been an elusive goal in NLP.A recent paradigm fine-tunes pre-trained LLMs to emulate human judgements for a particular task and evaluation criterion. Inspired by the generalization ability of instruction-tuned models, we propose a learned metric based on instruction tuning. To test our approach, we collected HEAP, a dataset of human judgements across various NLG tasks and evaluation criteria. Our findings demonstrate that instruction tuning LLMs on HEAP yields good performance on many evaluation tasks, though some criteria are less trivial to learn than others. Further, jointly training on multiple tasks can yield additional performance improvements, which can be beneficial for future tasks with little to no human annotated data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shuhaib Mehri (5 papers)
  2. Vered Shwartz (49 papers)
Citations (1)