BARTScore: A Generation-Based NLG Metric

Updated 1 March 2026

BARTScore is an unsupervised evaluation metric that models NLG tasks as sequence generation using a pretrained BART model.
It assesses key dimensions like fluency, adequacy, and faithfulness, outperforming traditional token-overlap metrics.
Extensions such as BARTScore++ and F-score variants enhance robustness and mitigate bias in applications like machine translation and summarization.

BARTScore is an unsupervised, generation-based automatic evaluation metric for natural language generation (NLG) tasks—including machine translation (MT), summarization, and data-to-text—constructed atop a pretrained BART sequence-to-sequence model. Unlike token-overlap (BLEU, ROUGE) or embedding-similarity (BERTScore) metrics, BARTScore directly models the evaluation task as sequence generation, with metric values based on the conditional log-likelihood under BART. This approach enables computation of flexible, context-aware assessments of hypothesis/reference (or source/hypothesis) pairs, targeting dimensions such as fluency, adequacy, and faithfulness (Yuan et al., 2021).

1. Mathematical Definition and Variants

Given a pretrained BART model parameterized by $\theta$ , conditioning text $X$ (reference $r$ or source $s$ ), and candidate output $Y = (y_1, ..., y_m)$ , the log-probability assigned by BART is:

$\mathrm{BARTScore}(X \to Y) = \frac{1}{m} \sum_{t=1}^{m} \log p_\theta(y_t \mid y_{<t}, X)$

All tokens are typically weighted equally ( $\omega_t = 1$ ) (Yuan et al., 2021, Yan et al., 2023). Variants are induced by the choice of conditioning:

$\mathrm{BARTScore}(r \to \mathrm{hyp})$ : assesses adequacy/fluency by evaluating how reconstructable the hypothesis is from the reference.
$\mathrm{BARTScore}(\mathrm{src} \to \mathrm{hyp})$ : assesses faithfulness in non-English translation or fact verification tasks.
$\mathrm{BARTScore}_F$ : the F-score, defined as the arithmetic mean of precision and recall variants:

$\mathrm{BARTScore}_F = \frac{1}{2} [\mathrm{BARTScore}(r \to h) + \mathrm{BARTScore}(h \to r)]$

Prompting and fine-tuning: Additional prompt tokens $z$ (prepended or appended) and task-specific fine-tuning (e.g., on CNN/DM summarization, paraphrase) further adapt BARTScore for informativeness or semantic overlap (Yuan et al., 2021).

2. Operational Paradigm and Implementation

BARTScore conceptualizes metric evaluation as a text generation task: the score reflects how likely BART would generate the candidate given a particular input (reference or source). This evaluation paradigm enables BARTScore to capture global fluency, semantic correctness, and syntactic coherence, beyond the local n-gram matching in classical metrics (Yuan et al., 2021, Lu et al., 2022).

Implementation uses publicly available BART checkpoints (“facebook/bart-large”, “bart-large-cnn”) and the HuggingFace Transformers stack (Yuan et al., 2021). For normalized comparison, length normalization is used, dividing the sequence score by its length. Practical details, including code release, batch scoring, and variant computation, are provided in the original repository.

3. Applications in NLG Evaluation

BARTScore has been applied across a broad spectrum of NLG tasks:

Machine Translation: Utilization in WMT evaluations; reference-based score proxies adequacy, while source-based versions target faithfulness in non-English pairs. F-score computes the mean of both directions, aligned with precision-recall evaluation (Yan et al., 2023).
Summarization: Assesses informativeness, coverage, and factuality. On Benchmarks such as REALSumm, SummEval, and NeR18, vanilla and prompt-variant BARTScore outperforms BERTScore, MoverScore, and, in many settings, supervised metrics (Yuan et al., 2021, Lu et al., 2022).
Data-to-Text: Demonstrates higher Spearman correlation for informativeness against prior unsupervised metrics.
Factuality: Achieves strong correlation with human judgments on datasets like QAGS and Rank19 (Lu et al., 2022).

Summary of competitive results (Spearman ρ, Kendall τ, Pearson r, ranking accuracy) is given across MT, SUM, D2T, and factuality settings, with BARTScore achieving SOTA alignment to human metrics in multiple experiments (Yuan et al., 2021, Lu et al., 2022).

4. Robustness, Pitfalls, and Mitigation Strategies

Empirical analysis reveals that BARTScore, while context- and semantics-aware, exhibits robustness deficits when used directly as an objective during Minimum Risk Training (MRT):

Universal Adversarial Translations: Under sequence-level MRT with BARTScore as the target metric, models can collapse to degenerate outputs (e.g., high-probability uninformative repetitions like “Mallorca! Mallorca! ...”), which receive erroneously high BARTScore (Yan et al., 2023).
Root Causes:

Distribution biases in the BART pretraining or parallel corpus data induce attractors—frequently observed sentence patterns can become adversarial optima.
The generation-based metric paradigm rewards sequences that are easy for BART to generate, regardless of semantic soundness.

Mitigation Techniques:
- Combined MRT + Token-level NLL Objective: Introduce a joint loss $L_{\text{total}}(\theta) = \lambda_{\mathrm{MRT}} L_{\mathrm{MRT}}(\theta) + (1 - \lambda_{\mathrm{MRT}}) L_{\mathrm{NLL}}(\theta)$ , blending global sequence and token-level supervision to prevent collapse.
- Metric Ensemble: Construct a composite metric (e.g., $\alpha$ ·BARTScore + $\beta$ ·BERTScore) to counteract the aforementioned attractor effects, improving robustness without sacrificing human metric alignment (Yan et al., 2023).

These interventions restore output entropy close to that of human targets and produce substantial gains in standard metrics (e.g., ∼7-point boosts in COMET and UniTE scores, 14.5% relative improvement in COMET/UniTE with robust ensembles) (Yan et al., 2023).

5. Extensions: BARTScore++ and Human-like Error Analysis

To address the need for more human-aligned automatic evaluation, BARTScore++ incorporates explicit error analysis inspired by Multidimensional Quality Metrics (MQM) (Lu et al., 2022). The methodology introduces:

Explicit and Implicit Error Splitting:
- Explicit Errors: Easily identifiable, high-impact errors—e.g., mistranslations, missing or spurious tokens.
- Implicit Errors: Subtle flaws affecting fluency or style.
Scoring Functions:
- Dist $_{\mathrm{exp}}$ = BARTScore $(y^*, r)$ – BARTScore $(y, r)$ , quantifying corrections after automatically detecting and refining explicit errors in the hypothesis $y$ .
- Dist $_{\mathrm{imp}}$ = –BARTScore $(y^*, r)$ , capturing the remaining gap to a perfect sentence.
- Aggregate score: negative weighted sum $-\bigl(\omega_{\mathrm{exp}} \cdot \mathrm{Dist}_{\mathrm{exp}} + \omega_{\mathrm{imp}} \cdot \mathrm{Dist}_{\mathrm{imp}})$ .
Iterative Detect–Correct Loop: Automated refinement via token-rank substitutions and re-scoring.
Empirical Gains: Across 20/25 benchmark settings, BARTScore++ outperforms vanilla BARTScore, including substantial increases in correlation with human MT and summarization judgments (Lu et al., 2022).

6. Known Limitations and Model Bias

BARTScore is susceptible to several limitations:

Self-model/Architecture Bias: When used as an evaluator, BARTScore systematically favors outputs from the same (or similarly fine-tuned) underlying BART architecture, especially in reference-free settings. This “narcissistic” tendency is evidenced by heatmap analyses and weak human alignment metrics (Liu et al., 2023). For example, BARTScore assigned highest scores to BART-generated summaries, even when they did not reflect higher human-judged quality.
Length Sensitivity: BARTScore (especially reference-free) correlates positively with summary length, leading to inflated scores for verbose generators.
Low Human Alignment in Reference-free Settings: Correlations with human coherence/consistency judgments are weak ( $\rho = 0.02$ –$0.24$) on SummEval (Liu et al., 2023).
Bias Mitigation Practices: Recommendations include ensembling metrics from diverse model families, normalizing for length, and avoiding use of the same model family for generator and evaluator.

7. Practical Recommendations and Future Directions

For robust use of BARTScore in NLG evaluation:

Avoid sole reliance on pure sequence-level BARTScore during training; apply token-level/ensemble constraints for metric robustness.
For MT, use F-score variants (both $r\to h$ and $h\to r$ ) and supplement automated scores with human evaluation, particularly in near-SOTA regimes.
In summarization and data-to-text, prompt-ensambling and fine-tuning push state-of-the-art alignment with human metrics.
Employ BARTScore++ or similar error-analytic extensions to improve major-minor error sensitivity.
Remain vigilant to evaluator-generator model overlap; combine diverse evaluation metrics and conduct targeted human analyses where bias or alignment issues may arise (Liu et al., 2023, Yan et al., 2023, Lu et al., 2022).

The BARTScore framework remains a flexible, model-agnostic, and empirically competitive neural metric for a wide array of generative evaluation scenarios, provided its vulnerabilities are addressed through informed system design (Yuan et al., 2021, Lu et al., 2022, Yan et al., 2023, Liu et al., 2023).