Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Machine Translation Evaluation Resources and Methods: A Survey (1605.04515v8)

Published 15 May 2016 in cs.CL
Machine Translation Evaluation Resources and Methods: A Survey

Abstract: We introduce the Machine Translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency, adequacy, comprehension, and informativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, and extended criteriea, etc. We classify the automatic evaluation methods into two categories, including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, F-measure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tag, phrase types and sentence structures, and the semantic features include named entity, synonyms, textual entailment, paraphrase, semantic roles, and LLMs. The deep learning models for evaluation are very newly proposed. Subsequently, we also introduce the evaluation methods for MT evaluation including different correlation scores, and the recent quality estimation (QE) tasks for MT. This paper differs from the existing works \cite{GALEprogram2009,EuroMatrixProject2007} from several aspects, by introducing some recent development of MT evaluation measures, the different classifications from manual to automatic evaluation measures, the introduction of recent QE tasks of MT, and the concise construction of the content. We hope this work will be helpful for MT researchers to easily pick up some metrics that are best suitable for their specific MT model development, and help MT evaluation researchers to get a general clue of how MT evaluation research developed. Furthermore, hopefully, this work can also shine some light on other evaluation tasks, except for translation, of NLP fields.

The paper "Machine Translation Evaluation Resources and Methods: A Survey" offers a comprehensive overview of the various methods and resources used in evaluating machine translation (MT) systems. The survey spans both manual and automatic evaluation techniques, aimed at providing a holistic understanding to researchers and developers in the field.

Manual Evaluation Methods

Traditional human evaluation criteria are thoroughly discussed, including:

  • Intelligibility: The ease with which a reader understands the translation.
  • Fidelity: The accuracy with which the translation reflects the meaning of the source text.
  • Fluency: The grammatical and stylistic quality of the translation.
  • Adequacy: The degree to which the translation conveys the information in the source text.
  • Comprehension: How well the translation is understood.
  • Informativeness: The richness of the content conveyed by the translation.

Advanced human assessments go further, incorporating:

  • Task-oriented measures: Performance in specific tasks using the translation.
  • Post-editing: The effort required to correct the translation.
  • Segment ranking: Ranking of different translation segments.
  • Extended criteria: Additional qualitative assessments.

Automatic Evaluation Methods

Automatic methods are classified into two main categories: lexical similarity and linguistic features.

Lexical Similarity Scenario

  • Edit Distance: Measures such as Levenshtein distance.
  • Precision, Recall, and F-measure: Based on common n-grams between the translation and reference text.
  • Word Order: Metrics concerned with the sequence of words.

Linguistic Features Application

  • Syntactic Features:
    • Part of Speech (POS) tags.
    • Phrase types and sentence structures.
  • Semantic Features:
    • Named entities.
    • Synonyms and paraphrases.
    • Textual entailment.
    • Semantic roles.
    • LLMs.

Deep Learning Models

The paper notes that deep learning-based evaluation models are relatively new and represent an emerging area of research in MT evaluation.

MT Evaluation Methods

The survey explores different evaluation methods, including:

  • Correlation Scores: Statistical measures to quantify the relationship between different evaluation metrics and human judgment.
  • Quality Estimation (QE): Tasks that predict the quality of translations without reference texts, offering real-time evaluation potential.

Contributions to the Field

The paper stands out from previous works by presenting:

  • Recent developments in MT evaluation metrics.
  • A new classification approach from manual to automatic evaluation measures.
  • An introduction to recent QE tasks in MT.
  • Concise content construction for ease of understanding.

Implications

The authors hope this survey will assist MT researchers in selecting suitable evaluation metrics for their models and offer MT evaluation researchers a broad overview of the field's evolution. This survey could also inspire methodologies for evaluating other NLP tasks beyond translation.

In essence, the paper is a pivotal resource for anyone aiming to navigate the complex landscape of MT evaluation, providing both foundational knowledge and insights into the latest advancements.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Lifeng Han (37 papers)
Citations (12)