Stemming Effectiveness Score (SES)
- Stemming Effectiveness Score (SES) is a metric that measures the trade-off between vocabulary compression and semantic fidelity in natural language processing.
- It is computed as the product of the Compression Ratio (CR) and the Information Retention Score (IRS), providing a single scalar for stemming evaluation.
- SES complements metrics like ANLD and MPD to identify over-stemming risks and ensure balanced performance in document-level semantic retention.
Stemming Effectiveness Score (SES) is a task-oriented metric devised for the quantitative evaluation of stemming algorithms in natural language processing pipelines. It captures the trade-off between vocabulary compression and semantic fidelity at the document level, enabling direct comparison of stemmers across languages and domains. SES serves as the core component within a multi-metric framework for stemming evaluation, complementing Model Performance Delta (MPD) for downstream impact, and Average Normalized Levenshtein Distance (ANLD) for morphological safety (Kafi et al., 25 Nov 2025).
1. Formal Definition
SES is defined as the product of the Compression Ratio (CR) and the Information Retention Score (IRS): where:
- : set of unique word types before stemming
- : set of unique word types after stemming
- , : document-level embedding vectors (mean-pooled transformer representations) of the original and stemmed text respectively
The constituent metrics are:
- Compression Ratio (CR):
where CR > 1 indicates a reduction in vocabulary size.
- Information Retention Score (IRS):
with ; 1 denotes perfect semantic preservation at the contextual embedding level.
Thus, SES rewards stemmers that yield substantial vocabulary compression while preserving semantic content as measured by document-level embeddings.
2. Conceptual Motivation
The motivation for SES arises from the dual objectives of stemming: reducing vocabulary (hence, dimensionality and computational load), while ensuring the preservation of semantic content required for downstream tasks. Pure vocabulary compression, as measured by CR, can lead to over-stemming, conflating semantically distinct tokens and harming interpretability and model performance. IRS quantifies the preservation of end-to-end document meaning despite token- or word-level alterations. Multiplying IRS by CR yields a scalar that reflects this trade-off:
- High CR with low IRS penalizes destructive over-stemming.
- Moderate CR with high IRS reflects balanced stemming.
- High values in both components are rare and highly desirable.
3. Computation Procedure
The procedure for calculating SES on a corpus of documents is as follows:
- Preprocessing:
- Tokenize all documents to obtain .
- Apply the stemmer to obtain .
- Compression Ratio:
- Embeddings:
- For each document, compute mean-pooled contextual embeddings of the original and stemmed tokens: , .
- Information Retention:
- Average over all documents:
- Combine:
4. Positioning Among Related Metrics
SES is distinct in tightly integrating semantic and efficiency considerations:
| Metric | Focus | Limitations |
|---|---|---|
| CR | Vocabulary compression | Ignores semantic loss |
| IRS | Semantic preservation | Ignores efficiency |
| SES | Trade-off (semantics × compression) | May miss micro-level errors |
| ANLD | Morphological safety | No embedding semantics |
| MPD | Downstream task impact | No decomposition of causes |
SES must be interpreted in conjunction with ANLD and MPD. ANLD, calculated as the average normalized Levenshtein distance between original and stemmed tokens,
quantifies aggressive token truncation that may be missed by document-level IRS. MPD assesses the actual effect of stemming on downstream tasks (e.g., classification accuracy). SES alone does not guarantee that stemming is safe; for instance, excessive SES driven by destructive over-stemming may be exposed by high ANLD and negative MPD.
5. Interpretation Guidelines and Thresholds
- SES > 1: Indicates that the combined effect of compression and semantic retention outperforms the unstemmed baseline.
- SES ≳ 1.2 and ANLD ≤ 0.15: Suggests effective and morphologically safe stemming.
- High SES with high ANLD (e.g., > 0.25): Warns of harmful over-stemming.
- Empirically, in (Kafi et al., 25 Nov 2025), SES values for English (Snowball) and Bangla (BNLTK) stemmers are 1.31 and 1.67, respectively, but BNLTK’s high SES is associated with ANLD = 0.26, highlighting over-stemming and a negative impact on downstream model performance.
A plausible implication is that SES must not be interpreted in isolation—complementary safety (ANLD) and effectiveness (MPD) checks are required.
6. Worked Examples
The following table summarizes SES calculations for empirical and toy cases:
| System | CR | IRS | SES | ANLD | ||
|---|---|---|---|---|---|---|
| English Snowball | 2175 | 1325 | 1.64 | 0.80 | 1.31 | 0.14 |
| Bangla BNLTK | 2956 | 1555 | 1.90 | 0.88 | 1.67 | 0.26 |
| Toy Example | 9 | 9 | 1.00 | 0.95 | 0.95 | 0.0 |
For the toy corpus (“cats chasing mice”, etc.), naïve stemming achieved no compression (CR = 1.0), high IRS (0.95), but yielded SES = 0.95—indicating no benefit over the baseline.
7. Strengths and Limitations
Strengths:
- Captures the efficiency/semantics trade-off in a single, interpretable scalar.
- Enables standardized comparison across languages, domains, and normalization systems.
- SES > 1 has a direct interpretation: stemming is net-positive in terms of efficiency and meaning.
Limitations:
- IRS depends on embedding quality, which may fail to capture subtle morphological errors or over-stemming at the token level.
- Document-level semantic similarity can overestimate correctness; multiple destructive changes can yield a misleadingly high SES.
- SES, if used alone, does not safeguard against over-aggressive stemming detected by ANLD.
- The metric assumes vocabulary reduction is always beneficial, which may not hold for specific tasks (e.g., NER), where token variants can be semantically relevant.
In practice, rigorous stemming evaluation demands concurrent reporting of SES, CR, IRS, ANLD, and MPD (with appropriate statistical significance tests). This multi-metric approach ensures that efficiency gains reflected in SES do not come at the cost of semantic or morphological harm, providing a comprehensive picture of normalization quality (Kafi et al., 25 Nov 2025).