Overview of SEM SCORE Metric
The paper presents a notable critique of the current state of automated evaluation metrics used for instruction-tuned LLMs. It underscores the lack of scalability in manual review processes and the potential misalignment of conventional metrics such as BLEU and ROUGE-L with human judgment. The authors advocate for an innovative metric, SEM SCORE, which leverages Semantic Textual Similarity (STS) to evaluate the outputs of LLMs against gold standard responses. Their analysis demonstrates that SEM SCORE outperforms other metrics, providing a stronger correlation with human evaluations.
Human-Judged Ranking and Automated Metrics
The researchers employed a human-judged ranking process based on a dataset consisting of 252 natural language tasks conducive to instruction-tuned LLMs. Notably, four additional models were appraised manually: GPT-4, GPT-3.5, LLaMA, and Alpaca-tuned LLaMA. This qualitative analysis produced a ranked order of LLM effectiveness. Furthermore, eight existing text generation metrics were examined alongside the proposed SEM SCORE for correlation with human judgment. Of these, SEM SCORE presented the highest correlation with human evaluations, suggesting its potential superiority as an automated assessment tool.
Introducing SEM SCORE
SEM SCORE is put forward as a metric based on direct application of semantic textual similarity. Its dual-step process involves embedding model and target responses using an advanced sentence transformer model, followed by computing the cosine similarity of these embeddings. The simplicity of SEM SCORE's approach offers an accessible alternative to costly or proprietary evaluation tools, enabling a broader application in LLM evaluation.
Results and Implications
In a comparative rank correlation analysis with human judgments, SEM SCORE registered the strongest correlation, followed by G-Eval and BERTScore. Nonetheless, G-Eval required the exclusion of the corresponding LLM undergoing evaluation to avoid bias, while SEM SCORE maintained its performance without such needs. The authors observe that while LLM-based evaluation approaches like G-Eval have promise, they may be susceptible to biases found in LLMs themselves.
In conclusion, the paper suggests that SEM SCORE not only aligns more consistently with human evaluative standards but also avoids pitfalls associated with proprietary models or potential biases within LLM-based evaluation methods. Its implementation could thus streamline and enhance the assessment of instruction-tuned LLMs. However, it is acknowledged that like all automated metrics, SEM SCORE depends on at least one gold-standard target output and is influenced by the choice of the underlying transformer model used for computing textual similarity.