Relaxed F1 and BERTScore Metrics

Updated 12 January 2026

The paper demonstrates how combining Relaxed F1 and BERTScore improves reinforcement learning reward signals by balancing lexical fidelity with semantic similarity.
Relaxed F1 reduces sensitivity to trivial text variations through normalization, while BERTScore leverages contextual embeddings to capture graded semantic matches.
The hybrid metric, deployed in SRAS, shows enhanced performance in NLG and retrieval tasks through empirical improvements in reward stability and alignment.

Relaxed F1 and BERTScore are evaluation metrics widely used for assessing the quality of generated text in natural language generation and retrieval-augmented generation (RAG) systems. Relaxed F1 is a lexical overlap-based metric designed to mitigate the brittleness of standard F1 to surface-form variations, while BERTScore leverages contextualized token embeddings to capture semantic similarity between generated and reference sequences. Their hybridization in reinforcement learning settings enables both stable policy learning and fidelity to downstream generation quality, as exemplified by their deployment in the SRAS lightweight RL-based document selector (Muttur, 5 Jan 2026). BERTScore, introduced by Zhang et al. (Zhang et al., 2019), has become a standard for semantic evaluation in NLG tasks.

1. Formal Definitions

Relaxed F1

Relaxed F1 quantifies normalized token overlap between a generated answer $\hat{y}$ and a reference answer $y$ . Tokens are normalized by lowercasing, punctuation stripping, and stopword removal. Let

$\mathrm{tok}(\cdot)$ denote the multiset of normalized word tokens.

Precision and recall are defined as: $P_{\mathrm{lex}} = \frac{|\mathrm{tok}(\hat{y}) \cap \mathrm{tok}(y)|}{|\mathrm{tok}(\hat{y})|}, \quad R_{\mathrm{lex}} = \frac{|\mathrm{tok}(\hat{y}) \cap \mathrm{tok}(y)|}{|\mathrm{tok}(y)|}.$ The Relaxed F1 score is the harmonic mean: $\mathrm{Relaxed\text{-}F1} = \frac{2 P_{\mathrm{lex}} R_{\mathrm{lex}}}{P_{\mathrm{lex}} + R_{\mathrm{lex}} + \varepsilon}$ where $\varepsilon$ is a small constant to avoid division by zero.

BERTScore

Let $\{h_i\}_{i=1}^m$ be the contextualized token embeddings of $\hat{y}$ , and $\{h'_j\}_{j=1}^n$ those of $y$ , computed by a frozen RoBERTa-large encoder. Compute token-pair cosine similarities.

Semantic precision and recall: $P_{\mathrm{sem}} = \frac{1}{m} \sum_{i=1}^m \max_{1 \leq j \leq n} \{\cos(h_i, h'_j)\}, \quad R_{\mathrm{sem}} = \frac{1}{n} \sum_{j=1}^n \max_{1 \leq i \leq m} \{\cos(h_i, h'_j)\}$ The BERTScore F1 is then: $\mathrm{BERTScore} = \frac{2 P_{\mathrm{sem}} R_{\mathrm{sem}}}{P_{\mathrm{sem}} + R_{\mathrm{sem}} + \varepsilon}$

2. Motivation and Rationale

Standard F1 metrics based on raw token overlap are sensitive to superficial differences (punctuation, case, filler words), often rendering learning signals brittle and uninformative in the presence of minor paraphrases. Relaxed F1 reduces this sensitivity, offering a more robust lexical signal under trivial surface changes (Muttur, 5 Jan 2026).

BERTScore generalizes the notion of alignment from discrete tokens to contextual embeddings, enabling graded, continuous matching. This provides robustness to paraphrasing, synonymy, and word order variation, and correlates strongly with human semantic judgments (Zhang et al., 2019).

Hybridizing the two metrics allows for a reward signal that balances local lexical agreement (crucial for answer faithfulness) and global semantic similarity (essential for downstream task performance and robustness to paraphrasing).

3. Metric Computation and Implementation

Metric	Token Processing	Matching Basis	Scoring Details
Relaxed F1	Lowercase, strip punctuation, remove stopwords	Normalized token overlap	Harmonic mean of normalized precision/recall
BERTScore	WordPiece tokenizer; l2-normalized embeddings	Contextualized token embeddings	Max-over-token-pair cosine similarities, F1 combination

For Relaxed F1, lexical normalization precedes computation. BERTScore employs contextual embeddings (typically from RoBERTa-large for English) with token alignment via greedy maximum cosine similarity. The implementation uses l2-normalization for embeddings and typically selects an intermediate transformer layer for scoring (e.g., layer 17 in RoBERTa-large (Zhang et al., 2019)). Scores can be rescaled to [0,1] by subtracting a baseline derived from random sentence pair similarities.

4. Role in Reinforcement Learning for Retrieval-Augmented Generation

SRAS integrates Relaxed F1 and BERTScore as reward components for reinforcement learning-based document selection under edge constraints (Muttur, 5 Jan 2026). The hybrid reward is computed as: $R = \alpha\,\mathrm{Relaxed\text{-}F1} + (1-\alpha)\,\mathrm{BERTScore}$ where $\alpha=0.6$ , selected by grid search to emphasize lexical fidelity early in training while leveraging the more stable gradients of BERTScore as learning progresses.

The reward for each PPO trajectory is normalized to zero mean and unit variance before advantage estimation. Policy updates use standard PPO hyperparameters (AdamW learning rate $1 \times 10^{-5}$ , discount $\gamma=0.99$ , clip $\epsilon=0.2$ ) and batch sizes of 8 trajectories (each selecting $k=3$ from 8 candidates). The system additionally employs supervised warmup, reward shaping (via the hybrid reward), and curriculum learning for stabilization.

5. Empirical Evaluation and Ablation

Ablation studies in SRAS (Muttur, 5 Jan 2026) elucidate the impact of Relaxed F1 and BERTScore in reward shaping. Removal of hybrid shaping ("NoRS") leads to substantial drops in both metrics: Relaxed F1 falls from 0.1473 (full system) to 0.0562, and BERTScore plateaus (~0.03 reward), indicating stalled training under sparse rewards. The omission of supervised warmup degrades BERTScore F1 and increases variance, while the absence of curriculum learning results in slightly lower Relaxed F1 and higher reward variance. On SQuAD v2, the full hybrid reward achieves BERTScore F1 of 0.8546, substantially outperforming random and top- $k$ cosine baselines in both lexical and semantic alignment.

6. Comparison and Broader Applications

BERTScore extends classic exact-match and relaxed matching metrics. Unlike unigram F1, which is oblivious to synonymy and paraphrase, BERTScore’s use of contextual similarity allows for matching semantically equivalent but lexically dissimilar phrases, outperforming string-based metrics in correlating with human judgments and handling adversarial paraphrasing (Zhang et al., 2019). The metric is model- and language-independent, relying on pre-trained Transformers with simple normalization and pooling strategies.

Relaxed F1, while a substantial improvement over naive token-matching for RL and evaluation, does not capture semantic equivalence beyond the surface or lexical paraphrase not normalized by common pre-processing. The hybrid use as in SRAS demonstrates that this combination supports both answer faithfulness and generalizable semantic reward, especially in environments with limited token or compute resources.

7. Concluding Remarks

Relaxed F1 and BERTScore exemplify two families of evaluation metric—robust lexical overlap and embedding-based semantic similarity—whose hybridization supports more stable, informative reward signals in RL-based generative modeling and retrieval selection tasks. BERTScore, leveraging Transformer-based contextual embeddings, generalizes classic precision/recall/F1 matching to continuous spaces, enabling unsupervised, language-agnostic performance superior to prior metrics. The demonstrated impact of these metrics on RL-driven systems such as SRAS, especially under edge and latency constraints, underscores their importance in advancing practical, efficient, and high-fidelity NLG and RAG pipelines (Muttur, 5 Jan 2026, Zhang et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

SRAS: A Lightweight Reinforcement Learning-based Document Selector for Edge-Native RAG Pipelines (2026)

BERTScore: Evaluating Text Generation with BERT (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Relaxed F1 and BERTScore.