Segment Error Rate in Speaker Diarization

Updated 16 November 2025

Segment Error Rate (SER) is a metric that quantifies diarization errors by counting mismatched speaker segments, focusing on short and infrequent utterances.
SER is computed through optimal speaker mapping, graph-based segment grouping, and an adaptive IoU threshold, ensuring precise segment-level evaluation.
Experimental results show that SER is more sensitive to segmentation errors than DER and JER, offering deeper insights for improving diarization systems.

Segment Error Rate (SER) is a metric introduced to provide a segment-level evaluation of speaker diarization systems, addressing limitations of traditional duration-weighted metrics such as Diarization Error Rate (DER) and Jaccard Error Rate (JER) (Liu et al., 2022). SER quantifies the fraction of reference speaker segments that are not correctly matched at the segment level, thereby emphasizing the accurate detection of short and infrequent speaker turns rather than total duration alone. This approach is particularly sensitive to errors involving short utterances or rarely speaking individuals, for which conventional metrics tend to be insensitive.

1. Formal Definition

SER is defined as the ratio: $\mathrm{SER} = \frac{\#\text{error\_segs}}{\#\text{REF\_segs}},\qquad \mathrm{SER}\in[0,1]$ where:

$\#\text{REF\_segs}$ is the total number of reference (ground-truth) speaker segments across all speakers;
$\#\text{error\_segs}$ is the number of reference segments that fail to be matched by any hypothesis segment according to a temporal-overlap and IoU-based matching rule.

SER is typically reported as a percentage ( $\mathrm{SER} \times 100$ ).

2. Algorithmic Procedure: Connected Sub-Graphs and Adaptive IoU

SER is computed according to the following multi-stage procedure:

A. Optimal Speaker Mapping

Perform bipartite (Hungarian) matching between reference and hypothesis speakers. Each reference speaker $s$ is mapped to a unique hypothesis speaker $h$ . Hypothesis speakers not mapped to any reference speaker are ignored for SER.

B. Per-Speaker Segment Matching

For each speaker pair $(s,h)$ $(s, h)$ :
1. Collect the set of all reference segments $U^{\text{REF}}$ (for $s$ ) and all hypothesis segments $U^{\text{HYP}}$ (for $h$ ).
2. Construct an undirected graph where each node is a segment (reference or hypothesis). Place an edge between nodes if the segments overlap in time.
3. Decompose the graph into connected components (sub-graphs) $\{G_k\}$ ; each $G_k$ contains temporally co-located reference and hypothesis segments.
4. For each component $G_k$ , compute the Intersection-over-Union (IoU) between the union of reference and hypothesis segments:
  
  $\mathrm{IoU} = \frac{|\bigcup r_i \cap \bigcup h_j|}{|\bigcup r_i \cup \bigcup h_j|}$

5. Compare IoU to an adaptive threshold $\tau_k$ :

$\tau_k = \max \left( \frac{D_k - 2 \cdot \mathrm{collar} \cdot n_k}{D_k + 2 \cdot \mathrm{collar} \cdot n_k},\, \mathrm{lb} \right)$

where $D_k$ is the total duration of reference segments in $G_k$ , $n_k$ is the number of reference segments in $G_k$ , $\mathrm{collar}$ is a small tolerance, and $\mathrm{lb}$ (lower bound) prevents unreasonable thresholds (e.g., $\mathrm{lb}=0.5$ ). 6. If $\mathrm{IoU} < \tau_k$ , all reference segments in $G_k$ are counted as errors; otherwise, as matched. 7. Any isolated reference node (no overlapping hypothesis) forms its own component with IoU $=0$ and is counted as error.

C. SER Finalization

$\#\text{error\_segs}$ is accumulated over all speakers and components. The final SER is computed per the definition above.

3. Concrete Example

Consider a reference speaker $s$ with three segments: R1 $[0.0,1.0]$ , R2 $[1.5,2.5]$ , R3 $[3.0,4.0]$ , and the matched hypothesis speaker $h$ with four segments: H1 $[0.0,0.7]$ , H2 $[0.8,1.1]$ , H3 $[1.4,2.6]$ , H4 $[3.5,3.8]$ .

G1: R1 overlaps H1, H2 ( $\mathrm{IoU}\approx0.91$ ).
G2: R2 overlaps H3 ( $\mathrm{IoU}\approx0.92$ ).
G3: R3 overlaps H4 ( $\mathrm{IoU}=0.3$ ).

Assuming threshold $\tau_k=0.5$ , G1 and G2 are matched; G3 fails (IoU $<\tau_k$ ), so R3 is error. If this pattern holds across all speakers such that $5$ out of $20$ reference segments are errors, then $\mathrm{SER}=25\%$ .

4. Comparison: SER vs. DER and JER

Metric	Unit of Error Weighting	Sensitivity to Short Segments	Speaker Normalization
DER	Duration	Low	No
JER	Duration (per speaker)	Low	Yes
SER	Segment	High	Yes

DER aggregates errors by total temporal duration, causing errors in short segments or less-talked speakers to be diluted. JER balances errors across speakers, but remains duration-centric within each speaker. SER instead counts each reference segment as a single unit, thus giving short utterances (such as "yes"/"no") and rarely present speakers full influence on the error rate. Systems that split or merge many short segments may have low DER/JER, but high SER, making it a sensitive diagnostic for segmentation mistakes.

5. Experimental Results and Observations

Empirical evaluation of SER on five public benchmarks (AMI, CALLHOME, DIHARD2, VoxConverse, MSDWild) demonstrates distinct diagnostic properties:

On CALLHOME 2-speaker: baseline modular pipeline achieves SER $\approx44.9\%$ , DER $\approx27.7\%$ , JER $\approx48.5\%$ .
A Bayesian HMM (VBx) system lowers both DER ( $21.1\%$ ) and SER ( $36.5\%$ ), indicating improved segmentation and time accuracy.
End-to-end EEND-VC, without segment priors, results in DER $\approx23.5\%$ , SER $\approx23.3\%$ , highlighting effective short-segment delineation.
A multi-modal MSDWild system yields a $\sim27\%$ DER drop but a $\sim46\%$ SER/BER drop, illustrating that visual cues particularly aid segment boundary accuracy.

SER thus exposes segmentation errors overlooked by duration-centric metrics and offers a complementary perspective on diarization performance.

6. Broader Impact and Integration in Comprehensive Metrics

SER has been incorporated into the Balanced Error Rate (BER) metric, combining duration error, segment error, and speaker-weighted error for a more complete evaluation of diarization systems. Through its segment-centric perspective, SER supports rigorous diagnosis of errors involving short or infrequently active speakers, which may be critical for applications demanding reliable detection of brief, yet semantically significant, utterances. Its design encourages the diarization community to optimize both temporal alignment and fine-grained segment detection, particularly in conditions where duration-only statistics are insufficiently discriminative.

7. Limitations and Use Considerations

SER relies on accurate segment-level annotations and robust speaker mapping; its value is maximized when used alongside DER and JER. Overly simplistic segmentation or collapsed matching can either artificially inflate or suppress SER, depending on segment granularity. Attention must be paid to the choice of collar and IoU lower bound thresholds to ensure metric stability, particularly for very short segments. Deploying SER in benchmarking or system development offers a principled approach to evaluating diarization models, especially in domains emphasizing local utterance discrimination rather than aggregate time allocation.

PDF Markdown Chat (Pro)

References (1)

BER: Balanced Error Rate For Speaker Diarization (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Segment Error Rate (SER).

Segment Error Rate in Speaker Diarization

1. Formal Definition

2. Algorithmic Procedure: Connected Sub-Graphs and Adaptive IoU

3. Concrete Example

4. Comparison: SER vs. DER and JER

5. Experimental Results and Observations

6. Broader Impact and Integration in Comprehensive Metrics

7. Limitations and Use Considerations

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Segment Error Rate in Speaker Diarization

1. Formal Definition

2. Algorithmic Procedure: Connected Sub-Graphs and Adaptive IoU

3. Concrete Example

4. Comparison: SER vs. DER and JER

5. Experimental Results and Observations

6. Broader Impact and Integration in Comprehensive Metrics

7. Limitations and Use Considerations

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research