Soft Biometric Leakage Score (SBLS)
- SBLS is a unified metric that quantifies the risk of soft-biometric leakage in anonymized speech systems by measuring direct attribute inference, linkage detection, and subgroup robustness.
- It aggregates normalized sub-scores using a weighted convex combination, providing an actionable assessment of vulnerabilities in speaker de-identification pipelines.
- Empirical evaluations reveal that while high protection against individual attribute inference is achievable, subgroup resilience remains a key challenge for robust anonymization.
The Soft Biometric Leakage Score (SBLS) is a unified quantitative metric for assessing the vulnerability of speaker de-identification systems to zero-shot inference attacks targeting soft biometric traits, such as channel type, age range, dialect, sex, or speaking style. SBLS integrates three orthogonal axes of evaluation—direct attribute inference, systematic linkage detection, and subgroup robustness—within a convex aggregation framework designed to rigorously capture the residual risk of attribute recovery from anonymized speech outputs. In contrast to traditional evaluation protocols that focus on individual-level identity re-identification, SBLS reveals that non-unique, group-level properties can be reliably inferred by adversaries armed only with pre-trained models, independent of access to original speech data or system details, exposing vulnerabilities overlooked by standard distributional or speaker-centric metrics (Seo et al., 17 Sep 2025).
1. Formal Definition of SBLS
SBLS is formally defined as a weighted convex combination of three normalized protection sub-scores—, , and —each measuring a distinct dimension of soft-biometric information leakage. The composite score is given by:
subject to and . In the primary reference experiments, weights are set as . Each component is interpreted such that 1 denotes perfect protection (absence of measurable leakage), while 0 denotes maximal leakage for the corresponding axis.
2. Sub-Scores: Attribute Inference, Linkage Detection, and Subgroup Robustness
The three constituent sub-scores operationalize complementary attacks and measurement targets:
Direct Attribute Inference Score ()
This assesses the extent to which a fixed (pre-trained, off-the-shelf) classifier can guess each soft-biometric attribute in from anonymized output. For 0, with 1 classes, one computes the class-wise one-versus-rest AUC:
2
Label permutation ambiguities are resolved by maximizing the mean AUC over all class permutations:
3
Leakage is the deviation of 4 above chance (0.5), normalized to 5, then inverted to yield a protection score:
6
Where only hard class assignments 7 are available, macro balanced accuracy can substitute for 8.
Systematic Linkage Detection Score (9)
This axis quantifies the mutual dependence between true attribute labels and classifier predictions after optimal permutation alignment. Mutual information for attribute 0 is calculated as:
1
Normalized by 2 for comparability, and then inverted to produce the normalized protection score:
3
A value of 1 indicates no empirical linkage (perfect anonymization) and 0 indicates maximal (deterministic) association.
Subgroup Robustness (4)
This dimension addresses resilience to attacks over all sufficiently large intersections of attributes, 5 (e.g., specific combinations like “young male”). For subgroup 6, leakage is measured as:
7
where 8 restricts to subgroup samples. The aggregate score combines the maximum (worst-case) and protection consistency across subgroups:
9
with 0 (empirically, 1) balancing emphasis between worst-case and dispersion.
3. Normalization, Weighting, and Aggregation
All three sub-scores are normalized to the 2 interval. By construction, 3, 4, and 5 each indicate the absence of measurable leakage along their respective axes, whereas 6 denotes maximal, exploitable leakage. Aggregation leverages weights 7 to reflect the evaluator’s prioritization of inference, linkage, and subgroup vulnerabilities, respectively.
4. Interpretation and Thresholding Considerations
An overall SBLS near 1 suggests the system resists all evaluated zero-shot attribute inference attacks; adversaries cannot recover soft biometric information with better than random performance. SBLS near 0 indicates strong recoverability of soft biometrics, i.e., the system fails to shield soft attributes from classifier-based attacks. Mid-range (e.g., 0.4–0.7) SBLS signals residual, but non-negligible, vulnerability to attribute recovery (a “warning zone”). While the metric is not intended to prescribe sharp certification boundaries, empirical observations recommend SBLS > 0.85 as indicating robust protection, and SBLS < 0.75 as a signifier of significant systemic leakage (Seo et al., 17 Sep 2025).
5. Empirical Evaluation of Speaker De-Identification Systems
SBLS was applied to five recent speaker de-identification systems, using publicly available classifiers and the recommended weighting (8). Results are summarized in the table below, presenting all per-component sub-scores and aggregated SBLS:
| System | 9 | 0 | 1 | SBLS |
|---|---|---|---|---|
| PHORTRESS | 0.994 | 0.998 | 0.531 | 2 0.903 |
| SHADOW | 0.936 | 1.000 | 0.501 | 3 0.874 |
| kNN-VC | 0.877 | 0.993 | 0.604 | 4 0.869 |
| RASP | 0.910 | 0.995 | 0.435 | 5 0.849 |
| VOXLET | 0.690 | 0.950 | 0.332 | 6 0.723 |
For all systems evaluated, subgroup resilience and direct attribute inference scores dominated as principal sources of SBLS degradation. This suggests that, under adversarial conditions where attackers have access to widely available pre-trained classifiers, state-of-the-art de-identification pipelines remain susceptible to soft-biometric leakage, with the level of protection varying accordant to model design and post-processing choices.
6. Scope, Limitations, and Research Significance
SBLS is designed to quantify resistance to zero-shot, attribute-level attacks where adversaries are restricted to using pre-trained classifiers and lack original speech or internal system access. No absolute privacy guarantee is stipulated, but the metric demonstrably exposes vulnerabilities that standard distributional and individual-level re-identification assessments cannot. A plausible implication is that future de-identification benchmarks might need to adopt or extend SBLS-type multi-axis evaluations to robustly characterize real-world exposure to inference and linkage attacks on soft biometrics (Seo et al., 17 Sep 2025).