Stemming Effectiveness Score (SES)

Updated 2 December 2025

Stemming Effectiveness Score (SES) is a metric that measures the trade-off between vocabulary compression and semantic fidelity in natural language processing.
It is computed as the product of the Compression Ratio (CR) and the Information Retention Score (IRS), providing a single scalar for stemming evaluation.
SES complements metrics like ANLD and MPD to identify over-stemming risks and ensure balanced performance in document-level semantic retention.

Stemming Effectiveness Score (SES) is a task-oriented metric devised for the quantitative evaluation of stemming algorithms in natural language processing pipelines. It captures the trade-off between vocabulary compression and semantic fidelity at the document level, enabling direct comparison of stemmers across languages and domains. SES serves as the core component within a multi-metric framework for stemming evaluation, complementing Model Performance Delta (MPD) for downstream impact, and Average Normalized Levenshtein Distance (ANLD) for morphological safety (Kafi et al., 25 Nov 2025).

1. Formal Definition

SES is defined as the product of the Compression Ratio (CR) and the Information Retention Score (IRS): $\mathrm{SES} = \mathrm{IRS} \times \mathrm{CR}$ where:

$U_{\rm orig}$ : set of unique word types before stemming
$U_{\rm stem}$ : set of unique word types after stemming
$V_{\rm orig}$ , $V_{\rm stem}$ : document-level embedding vectors (mean-pooled transformer representations) of the original and stemmed text respectively

The constituent metrics are:

Compression Ratio (CR):

$\mathrm{CR} = \frac{\lvert U_{\rm orig}\rvert}{\lvert U_{\rm stem}\rvert}$

where CR > 1 indicates a reduction in vocabulary size.

Information Retention Score (IRS):

$\mathrm{IRS} = \cos(V_{\rm orig}, V_{\rm stem}) = \frac{V_{\rm orig}\cdot V_{\rm stem}}{\|V_{\rm orig}\|\;\|V_{\rm stem}\|}$

with $0 \leq \mathrm{IRS} \leq 1$ ; 1 denotes perfect semantic preservation at the contextual embedding level.

Thus, SES rewards stemmers that yield substantial vocabulary compression while preserving semantic content as measured by document-level embeddings.

2. Conceptual Motivation

The motivation for SES arises from the dual objectives of stemming: reducing vocabulary (hence, dimensionality and computational load), while ensuring the preservation of semantic content required for downstream tasks. Pure vocabulary compression, as measured by CR, can lead to over-stemming, conflating semantically distinct tokens and harming interpretability and model performance. IRS quantifies the preservation of end-to-end document meaning despite token- or word-level alterations. Multiplying IRS by CR yields a scalar that reflects this trade-off:

High CR with low IRS penalizes destructive over-stemming.
Moderate CR with high IRS reflects balanced stemming.
High values in both components are rare and highly desirable.

3. Computation Procedure

The procedure for calculating SES on a corpus of $N$ documents is as follows:

Preprocessing:
- Tokenize all documents to obtain $U_{\rm orig}$ .
- Apply the stemmer to obtain $U_{\rm stem}$ .
Compression Ratio:

$\mathrm{CR} = \frac{\lvert U_{\rm orig}\rvert}{\lvert U_{\rm stem}\rvert}$

Embeddings:
- For each document, compute mean-pooled contextual embeddings of the original and stemmed tokens: $V_{\rm orig}^{(i)}$ , $V_{\rm stem}^{(i)}$ .
Information Retention:

$\mathrm{IRS}^{(i)} = \frac{V_{\rm orig}^{(i)}\cdot V_{\rm stem}^{(i)}}{ \|V_{\rm orig}^{(i)}\|\, \|V_{\rm stem}^{(i)}\| }$

Average over all $N$ documents:

$\mathrm{IRS} = \frac{1}{N} \sum_{i=1}^N \mathrm{IRS}^{(i)}$

Combine:

$\mathrm{SES} = \mathrm{IRS} \times \mathrm{CR}$

SES is distinct in tightly integrating semantic and efficiency considerations:

Metric	Focus	Limitations
CR	Vocabulary compression	Ignores semantic loss
IRS	Semantic preservation	Ignores efficiency
SES	Trade-off (semantics × compression)	May miss micro-level errors
ANLD	Morphological safety	No embedding semantics
MPD	Downstream task impact	No decomposition of causes

SES must be interpreted in conjunction with ANLD and MPD. ANLD, calculated as the average normalized Levenshtein distance between original and stemmed tokens,

$\mathrm{ANLD} = \frac{1}{M} \sum_{j=1}^M \frac{\mathrm{lev}(\text{orig}_j, \text{stem}_j)}{|\text{orig}_j|}$

quantifies aggressive token truncation that may be missed by document-level IRS. MPD assesses the actual effect of stemming on downstream tasks (e.g., classification accuracy). SES alone does not guarantee that stemming is safe; for instance, excessive SES driven by destructive over-stemming may be exposed by high ANLD and negative MPD.

5. Interpretation Guidelines and Thresholds

SES > 1: Indicates that the combined effect of compression and semantic retention outperforms the unstemmed baseline.
SES ≳ 1.2 and ANLD ≤ 0.15: Suggests effective and morphologically safe stemming.
High SES with high ANLD (e.g., > 0.25): Warns of harmful over-stemming.
Empirically, in (Kafi et al., 25 Nov 2025), SES values for English (Snowball) and Bangla (BNLTK) stemmers are 1.31 and 1.67, respectively, but BNLTK’s high SES is associated with ANLD = 0.26, highlighting over-stemming and a negative impact on downstream model performance.

A plausible implication is that SES must not be interpreted in isolation—complementary safety (ANLD) and effectiveness (MPD) checks are required.

6. Worked Examples

The following table summarizes SES calculations for empirical and toy cases:

System	$\|U_{\rm orig}\|$	$\|U_{\rm stem}\|$	CR	IRS	SES	ANLD
English Snowball	2175	1325	1.64	0.80	1.31	0.14
Bangla BNLTK	2956	1555	1.90	0.88	1.67	0.26
Toy Example	9	9	1.00	0.95	0.95	0.0

For the toy corpus (“cats chasing mice”, etc.), naïve stemming achieved no compression (CR = 1.0), high IRS (0.95), but yielded SES = 0.95—indicating no benefit over the baseline.

7. Strengths and Limitations

Strengths:

Captures the efficiency/semantics trade-off in a single, interpretable scalar.
Enables standardized comparison across languages, domains, and normalization systems.
SES > 1 has a direct interpretation: stemming is net-positive in terms of efficiency and meaning.

Limitations:

IRS depends on embedding quality, which may fail to capture subtle morphological errors or over-stemming at the token level.
Document-level semantic similarity can overestimate correctness; multiple destructive changes can yield a misleadingly high SES.
SES, if used alone, does not safeguard against over-aggressive stemming detected by ANLD.
The metric assumes vocabulary reduction is always beneficial, which may not hold for specific tasks (e.g., NER), where token variants can be semantically relevant.

In practice, rigorous stemming evaluation demands concurrent reporting of SES, CR, IRS, ANLD, and MPD (with appropriate statistical significance tests). This multi-metric approach ensures that efficiency gains reflected in SES do not come at the cost of semantic or morphological harm, providing a comprehensive picture of normalization quality (Kafi et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP Pipelines (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Stemming Effectiveness Score (SES).

Stemming Effectiveness Score (SES)

1. Formal Definition

2. Conceptual Motivation

3. Computation Procedure

4. Positioning Among Related Metrics

5. Interpretation Guidelines and Thresholds

6. Worked Examples

7. Strengths and Limitations

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics