Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Which is better? Exploring Prompting Strategy For LLM-based Metrics (2311.03754v1)

Published 7 Nov 2023 in cs.CL

Abstract: This paper describes the DSBA submissions to the Prompting LLMs as Explainable Metrics shared task, where systems were submitted to two tracks: small and large summarization tracks. With advanced LLMs such as GPT-4, evaluating the quality of Natural Language Generation (NLG) has become increasingly paramount. Traditional similarity-based metrics such as BLEU and ROUGE have shown to misalign with human evaluation and are ill-suited for open-ended generation tasks. To address this issue, we explore the potential capability of LLM-based metrics, especially leveraging open-source LLMs. In this study, wide range of prompts and prompting techniques are systematically analyzed with three approaches: prompting strategy, score aggregation, and explainability. Our research focuses on formulating effective prompt templates, determining the granularity of NLG quality scores and assessing the impact of in-context examples on LLM-based evaluation. Furthermore, three aggregation strategies are compared to identify the most reliable method for aggregating NLG quality scores. To examine explainability, we devise a strategy that generates rationales for the scores and analyzes the characteristics of the explanation produced by the open-source LLMs. Extensive experiments provide insights regarding evaluation capabilities of open-source LLMs and suggest effective prompting strategies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Joonghoon Kim (3 papers)
  2. Saeran Park (1 paper)
  3. Kiyoon Jeong (3 papers)
  4. Sangmin Lee (85 papers)
  5. Seung Hun Han (1 paper)
  6. Jiyoon Lee (2 papers)
  7. Pilsung Kang (28 papers)
Citations (7)