SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity (2401.17072v2)

Published 30 Jan 2024 in cs.CL

Abstract: Instruction-tuned LLMs have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluation is time-consuming, it does not easily scale to the evaluation of multiple models and model variants. In this short paper, we propose a straightforward but remarkably effective evaluation metric called SemScore, in which we directly compare model outputs to gold target responses using semantic textual similarity (STS). We conduct a comparative evaluation of the model outputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation. We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation. These findings indicate the utility of our proposed metric for the evaluation of instruction-tuned LLMs.

PDF Abstract

Overview of SEM SCORE Metric

The paper presents a notable critique of the current state of automated evaluation metrics used for instruction-tuned LLMs. It underscores the lack of scalability in manual review processes and the potential misalignment of conventional metrics such as BLEU and ROUGE-L with human judgment. The authors advocate for an innovative metric, SEM SCORE, which leverages Semantic Textual Similarity (STS) to evaluate the outputs of LLMs against gold standard responses. Their analysis demonstrates that SEM SCORE outperforms other metrics, providing a stronger correlation with human evaluations.

Human-Judged Ranking and Automated Metrics

The researchers employed a human-judged ranking process based on a dataset consisting of 252 natural language tasks conducive to instruction-tuned LLMs. Notably, four additional models were appraised manually: GPT-4, GPT-3.5, LLaMA, and Alpaca-tuned LLaMA. This qualitative analysis produced a ranked order of LLM effectiveness. Furthermore, eight existing text generation metrics were examined alongside the proposed SEM SCORE for correlation with human judgment. Of these, SEM SCORE presented the highest correlation with human evaluations, suggesting its potential superiority as an automated assessment tool.

Introducing SEM SCORE

SEM SCORE is put forward as a metric based on direct application of semantic textual similarity. Its dual-step process involves embedding model and target responses using an advanced sentence transformer model, followed by computing the cosine similarity of these embeddings. The simplicity of SEM SCORE's approach offers an accessible alternative to costly or proprietary evaluation tools, enabling a broader application in LLM evaluation.

Results and Implications

In a comparative rank correlation analysis with human judgments, SEM SCORE registered the strongest correlation, followed by G-Eval and BERTScore. Nonetheless, G-Eval required the exclusion of the corresponding LLM undergoing evaluation to avoid bias, while SEM SCORE maintained its performance without such needs. The authors observe that while LLM-based evaluation approaches like G-Eval have promise, they may be susceptible to biases found in LLMs themselves.

In conclusion, the paper suggests that SEM SCORE not only aligns more consistently with human evaluative standards but also avoids pitfalls associated with proprietary models or potential biases within LLM-based evaluation methods. Its implementation could thus streamline and enhance the assessment of instruction-tuned LLMs. However, it is acknowledged that like all automated metrics, SEM SCORE depends on at least one gold-standard target output and is influenced by the choice of the underlying transformer model used for computing textual similarity.