Segment Error Rate in Speaker Diarization
- Segment Error Rate (SER) is a metric that quantifies diarization errors by counting mismatched speaker segments, focusing on short and infrequent utterances.
- SER is computed through optimal speaker mapping, graph-based segment grouping, and an adaptive IoU threshold, ensuring precise segment-level evaluation.
- Experimental results show that SER is more sensitive to segmentation errors than DER and JER, offering deeper insights for improving diarization systems.
Segment Error Rate (SER) is a metric introduced to provide a segment-level evaluation of speaker diarization systems, addressing limitations of traditional duration-weighted metrics such as Diarization Error Rate (DER) and Jaccard Error Rate (JER) (Liu et al., 2022). SER quantifies the fraction of reference speaker segments that are not correctly matched at the segment level, thereby emphasizing the accurate detection of short and infrequent speaker turns rather than total duration alone. This approach is particularly sensitive to errors involving short utterances or rarely speaking individuals, for which conventional metrics tend to be insensitive.
1. Formal Definition
SER is defined as the ratio: where:
- is the total number of reference (ground-truth) speaker segments across all speakers;
- is the number of reference segments that fail to be matched by any hypothesis segment according to a temporal-overlap and IoU-based matching rule.
SER is typically reported as a percentage ().
2. Algorithmic Procedure: Connected Sub-Graphs and Adaptive IoU
SER is computed according to the following multi-stage procedure:
A. Optimal Speaker Mapping
- Perform bipartite (Hungarian) matching between reference and hypothesis speakers. Each reference speaker is mapped to a unique hypothesis speaker . Hypothesis speakers not mapped to any reference speaker are ignored for SER.
B. Per-Speaker Segment Matching
- For each speaker pair :
- Collect the set of all reference segments (for ) and all hypothesis segments (for ).
- Construct an undirected graph where each node is a segment (reference or hypothesis). Place an edge between nodes if the segments overlap in time.
- Decompose the graph into connected components (sub-graphs) ; each contains temporally co-located reference and hypothesis segments.
For each component , compute the Intersection-over-Union (IoU) between the union of reference and hypothesis segments:
5. Compare IoU to an adaptive threshold :
where is the total duration of reference segments in , is the number of reference segments in , is a small tolerance, and (lower bound) prevents unreasonable thresholds (e.g., ). 6. If , all reference segments in are counted as errors; otherwise, as matched. 7. Any isolated reference node (no overlapping hypothesis) forms its own component with IoU and is counted as error.
C. SER Finalization
- is accumulated over all speakers and components. The final SER is computed per the definition above.
3. Concrete Example
Consider a reference speaker with three segments: R1 , R2 , R3 , and the matched hypothesis speaker with four segments: H1 , H2 , H3 , H4 .
- G1: R1 overlaps H1, H2 ().
- G2: R2 overlaps H3 ().
- G3: R3 overlaps H4 ().
Assuming threshold , G1 and G2 are matched; G3 fails (IoU), so R3 is error. If this pattern holds across all speakers such that $5$ out of $20$ reference segments are errors, then .
4. Comparison: SER vs. DER and JER
| Metric | Unit of Error Weighting | Sensitivity to Short Segments | Speaker Normalization |
|---|---|---|---|
| DER | Duration | Low | No |
| JER | Duration (per speaker) | Low | Yes |
| SER | Segment | High | Yes |
DER aggregates errors by total temporal duration, causing errors in short segments or less-talked speakers to be diluted. JER balances errors across speakers, but remains duration-centric within each speaker. SER instead counts each reference segment as a single unit, thus giving short utterances (such as "yes"/"no") and rarely present speakers full influence on the error rate. Systems that split or merge many short segments may have low DER/JER, but high SER, making it a sensitive diagnostic for segmentation mistakes.
5. Experimental Results and Observations
Empirical evaluation of SER on five public benchmarks (AMI, CALLHOME, DIHARD2, VoxConverse, MSDWild) demonstrates distinct diagnostic properties:
- On CALLHOME 2-speaker: baseline modular pipeline achieves SER , DER , JER .
- A Bayesian HMM (VBx) system lowers both DER () and SER (), indicating improved segmentation and time accuracy.
- End-to-end EEND-VC, without segment priors, results in DER , SER , highlighting effective short-segment delineation.
- A multi-modal MSDWild system yields a DER drop but a SER/BER drop, illustrating that visual cues particularly aid segment boundary accuracy.
SER thus exposes segmentation errors overlooked by duration-centric metrics and offers a complementary perspective on diarization performance.
6. Broader Impact and Integration in Comprehensive Metrics
SER has been incorporated into the Balanced Error Rate (BER) metric, combining duration error, segment error, and speaker-weighted error for a more complete evaluation of diarization systems. Through its segment-centric perspective, SER supports rigorous diagnosis of errors involving short or infrequently active speakers, which may be critical for applications demanding reliable detection of brief, yet semantically significant, utterances. Its design encourages the diarization community to optimize both temporal alignment and fine-grained segment detection, particularly in conditions where duration-only statistics are insufficiently discriminative.
7. Limitations and Use Considerations
SER relies on accurate segment-level annotations and robust speaker mapping; its value is maximized when used alongside DER and JER. Overly simplistic segmentation or collapsed matching can either artificially inflate or suppress SER, depending on segment granularity. Attention must be paid to the choice of collar and IoU lower bound thresholds to ensure metric stability, particularly for very short segments. Deploying SER in benchmarking or system development offers a principled approach to evaluating diarization models, especially in domains emphasizing local utterance discrimination rather than aggregate time allocation.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free