Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 52 tok/s Pro
Gemini 2.5 Flash 159 tok/s Pro
Kimi K2 203 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Segment Error Rate in Speaker Diarization

Updated 16 November 2025
  • Segment Error Rate (SER) is a metric that quantifies diarization errors by counting mismatched speaker segments, focusing on short and infrequent utterances.
  • SER is computed through optimal speaker mapping, graph-based segment grouping, and an adaptive IoU threshold, ensuring precise segment-level evaluation.
  • Experimental results show that SER is more sensitive to segmentation errors than DER and JER, offering deeper insights for improving diarization systems.

Segment Error Rate (SER) is a metric introduced to provide a segment-level evaluation of speaker diarization systems, addressing limitations of traditional duration-weighted metrics such as Diarization Error Rate (DER) and Jaccard Error Rate (JER) (Liu et al., 2022). SER quantifies the fraction of reference speaker segments that are not correctly matched at the segment level, thereby emphasizing the accurate detection of short and infrequent speaker turns rather than total duration alone. This approach is particularly sensitive to errors involving short utterances or rarely speaking individuals, for which conventional metrics tend to be insensitive.

1. Formal Definition

SER is defined as the ratio: SER=#error_segs#REF_segs,SER[0,1]\mathrm{SER} = \frac{\#\text{error\_segs}}{\#\text{REF\_segs}},\qquad \mathrm{SER}\in[0,1] where:

  • #REF_segs\#\text{REF\_segs} is the total number of reference (ground-truth) speaker segments across all speakers;
  • #error_segs\#\text{error\_segs} is the number of reference segments that fail to be matched by any hypothesis segment according to a temporal-overlap and IoU-based matching rule.

SER is typically reported as a percentage (SER×100\mathrm{SER} \times 100).

2. Algorithmic Procedure: Connected Sub-Graphs and Adaptive IoU

SER is computed according to the following multi-stage procedure:

A. Optimal Speaker Mapping

  • Perform bipartite (Hungarian) matching between reference and hypothesis speakers. Each reference speaker ss is mapped to a unique hypothesis speaker hh. Hypothesis speakers not mapped to any reference speaker are ignored for SER.

B. Per-Speaker Segment Matching

  • For each speaker pair (s,h)(s,h):

    1. Collect the set of all reference segments UREFU^{\text{REF}} (for ss) and all hypothesis segments UHYPU^{\text{HYP}} (for hh).
    2. Construct an undirected graph where each node is a segment (reference or hypothesis). Place an edge between nodes if the segments overlap in time.
    3. Decompose the graph into connected components (sub-graphs) {Gk}\{G_k\}; each GkG_k contains temporally co-located reference and hypothesis segments.
    4. For each component GkG_k, compute the Intersection-over-Union (IoU) between the union of reference and hypothesis segments:

      IoU=rihjrihj\mathrm{IoU} = \frac{|\bigcup r_i \cap \bigcup h_j|}{|\bigcup r_i \cup \bigcup h_j|}

5. Compare IoU to an adaptive threshold τk\tau_k:

τk=max(Dk2collarnkDk+2collarnk,lb)\tau_k = \max \left( \frac{D_k - 2 \cdot \mathrm{collar} \cdot n_k}{D_k + 2 \cdot \mathrm{collar} \cdot n_k},\, \mathrm{lb} \right)

where DkD_k is the total duration of reference segments in GkG_k, nkn_k is the number of reference segments in GkG_k, collar\mathrm{collar} is a small tolerance, and lb\mathrm{lb} (lower bound) prevents unreasonable thresholds (e.g., lb=0.5\mathrm{lb}=0.5). 6. If IoU<τk\mathrm{IoU} < \tau_k, all reference segments in GkG_k are counted as errors; otherwise, as matched. 7. Any isolated reference node (no overlapping hypothesis) forms its own component with IoU=0=0 and is counted as error.

C. SER Finalization

  • #error_segs\#\text{error\_segs} is accumulated over all speakers and components. The final SER is computed per the definition above.

3. Concrete Example

Consider a reference speaker ss with three segments: R1 [0.0,1.0][0.0,1.0], R2 [1.5,2.5][1.5,2.5], R3 [3.0,4.0][3.0,4.0], and the matched hypothesis speaker hh with four segments: H1 [0.0,0.7][0.0,0.7], H2 [0.8,1.1][0.8,1.1], H3 [1.4,2.6][1.4,2.6], H4 [3.5,3.8][3.5,3.8].

  • G1: R1 overlaps H1, H2 (IoU0.91\mathrm{IoU}\approx0.91).
  • G2: R2 overlaps H3 (IoU0.92\mathrm{IoU}\approx0.92).
  • G3: R3 overlaps H4 (IoU=0.3\mathrm{IoU}=0.3).

Assuming threshold τk=0.5\tau_k=0.5, G1 and G2 are matched; G3 fails (IoU<τk<\tau_k), so R3 is error. If this pattern holds across all speakers such that $5$ out of $20$ reference segments are errors, then SER=25%\mathrm{SER}=25\%.

4. Comparison: SER vs. DER and JER

Metric Unit of Error Weighting Sensitivity to Short Segments Speaker Normalization
DER Duration Low No
JER Duration (per speaker) Low Yes
SER Segment High Yes

DER aggregates errors by total temporal duration, causing errors in short segments or less-talked speakers to be diluted. JER balances errors across speakers, but remains duration-centric within each speaker. SER instead counts each reference segment as a single unit, thus giving short utterances (such as "yes"/"no") and rarely present speakers full influence on the error rate. Systems that split or merge many short segments may have low DER/JER, but high SER, making it a sensitive diagnostic for segmentation mistakes.

5. Experimental Results and Observations

Empirical evaluation of SER on five public benchmarks (AMI, CALLHOME, DIHARD2, VoxConverse, MSDWild) demonstrates distinct diagnostic properties:

  • On CALLHOME 2-speaker: baseline modular pipeline achieves SER 44.9%\approx44.9\%, DER 27.7%\approx27.7\%, JER 48.5%\approx48.5\%.
  • A Bayesian HMM (VBx) system lowers both DER (21.1%21.1\%) and SER (36.5%36.5\%), indicating improved segmentation and time accuracy.
  • End-to-end EEND-VC, without segment priors, results in DER 23.5%\approx23.5\%, SER 23.3%\approx23.3\%, highlighting effective short-segment delineation.
  • A multi-modal MSDWild system yields a 27%\sim27\% DER drop but a 46%\sim46\% SER/BER drop, illustrating that visual cues particularly aid segment boundary accuracy.

SER thus exposes segmentation errors overlooked by duration-centric metrics and offers a complementary perspective on diarization performance.

6. Broader Impact and Integration in Comprehensive Metrics

SER has been incorporated into the Balanced Error Rate (BER) metric, combining duration error, segment error, and speaker-weighted error for a more complete evaluation of diarization systems. Through its segment-centric perspective, SER supports rigorous diagnosis of errors involving short or infrequently active speakers, which may be critical for applications demanding reliable detection of brief, yet semantically significant, utterances. Its design encourages the diarization community to optimize both temporal alignment and fine-grained segment detection, particularly in conditions where duration-only statistics are insufficiently discriminative.

7. Limitations and Use Considerations

SER relies on accurate segment-level annotations and robust speaker mapping; its value is maximized when used alongside DER and JER. Overly simplistic segmentation or collapsed matching can either artificially inflate or suppress SER, depending on segment granularity. Attention must be paid to the choice of collar and IoU lower bound thresholds to ensure metric stability, particularly for very short segments. Deploying SER in benchmarking or system development offers a principled approach to evaluating diarization models, especially in domains emphasizing local utterance discrimination rather than aggregate time allocation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Segment Error Rate (SER).