Speech Length Compliant (SLC) Score
- SLC Score is a metric that defines the ratio of speech output duration to input duration to assess temporal consistency in translation systems.
- It uses a standardized EvalRecord to map diverse data inputs, supporting both streaming and offline system evaluations.
- The metric aids in comparing systems by balancing latency, translation quality, and speech naturalness across various evaluation dimensions.
OpenSTBench is an open-source, Python-based framework for the unified, multidimensional evaluation of speech translation systems, encompassing both speech-to-text translation (“S2TT”) and speech-to-speech translation (“S2ST”) in offline and streaming configurations. It addresses the heterogeneity of system outputs—varying in modality, realization, and temporal behavior—by organizing data into a standardized format (“EvalRecord”) and applying synchronized protocols for metric computation, aggregation, and reporting. OpenSTBench supports comprehensive joint evaluation across translation quality, speech quality, speaker and emotion preservation, paralinguistic fidelity, temporal consistency, and latency, facilitating application-oriented comparisons of end-to-end speech translation systems (An et al., 29 May 2026).
1. Framework Architecture and Data Flow
The OpenSTBench workflow is structured in four sequential stages designed to ensure consistent metric computation and reproducibility:
- Input Processing and Normalization
- System output is read by a system-adapter supporting text transcripts, audio files, and timing.
- For S2ST outputs, an ASR model (e.g., Whisper) may provide an automatic transcript for downstream metrics.
- Streaming outputs provide per-segment or per-token timestamps; offline outputs supply utterance-level timing.
- Conversion to Evaluation Record (“EvalRecord”)
- All normalized data is mapped into a Python object with fields for identifiers, audio inputs, segment timestamps, source/target text, output audio, timing, and metadata describing language pair and system type.
- Metric Computation
- An extensible registry of “Evaluator” modules selectively operates on the EvalRecord, invoking only metrics that are appropriate for each sample and system type.
- Text-side metrics: BLEU, chrF++, COMET, BLEURT.
- Speech-side metrics: UTMOS, CER/WER, speaker similarity, emotion similarity, event F1.
- Temporal metrics: SLC (speech length compliance), Start Offset, Average Token Delay (ATD), Custom ATD, Real-Time Factor (RTF).
- Aggregation and Reporting
- Per-sample scores for each metric are aggregated to obtain means.
- Summarized results are exported to configurable JSON/CSV, with optional radar plot visualizations after fixed-range normalization.
All system outputs are thus mapped to triplets , supporting direct comparison between S2TT (no audio output) and S2ST (with or without streaming).
2. Supported Modalities, Modes, and Data Handling
OpenSTBench is engineered for flexibility across system and operational axes:
- System Types
- S2TT: Speech-to-text translation; outputs transcripts, skips speech metrics.
- S2ST: Speech-to-speech translation; supports both text and audio evaluation dimensions.
- Operating Modes
- Offline: Utterance-level input/output.
- Streaming: Incremental output with fine-grained temporal data. Partial outputs are timestamped per token (text) or segment end (audio).
- Alignment and Interface
- OpenSTBench provides a SimulEval-style server/client interface, segmenting source audio and aligning system-generated outputs temporally for latency and consistency metrics. Output tokens or segments are mapped to input segments by timestamp alignment.
3. Evaluation Dimensions and Metric Definitions
OpenSTBench jointly evaluates a comprehensive vector of dimensions, each with established or purpose-built metrics:
| Dimension | Metrics | Applicability |
|---|---|---|
| Translation Quality | BLEU, chrF++, COMET, BLEURT | S2TT, S2ST (transcripts) |
| Speech Quality | UTMOS, CER/WER | S2ST |
| Speaker Preservation | Resemblyzer/WavLM cosine similarity | S2ST (paired samples) |
| Emotion Preservation | Emotion2Vec similarity, classification accuracy | S2ST (paired samples) |
| Paralinguistic Fidelity | Event Content F1, Event Timing F1 (CLAP-detected event matching) | S2ST (event-annotated) |
| Temporal Consistency | SLC (speech length compliance), typically SLC, SLC | All |
| Latency | Start Offset, ATD, Custom ATD, RTF | Streaming (S2TT, S2ST) |
BLEU is calculated as
with brevity penalty BP and clipped n-gram precision . chrF++ computes a character/word-level F-score. COMET and BLEURT are neural and BERT-based learned estimators, respectively.
UTMOS predicts mean opinion score for naturalness. CER/WER utilize reference ASR transcripts to compute character/word error rate: with common nomenclature.
Speaker similarity employs embedding cosine similarity:
Emotion and Paralinguistic Event Preservation: Emotion2Vec embedding similarity; event F1 for label and timely detection, with events localized by CLAP and voice activity detection.
Temporal Consistency is measured by SLC (e.g., SLC and SLC for 0): 1
Latency: Start Offset, ATD (average token delay), Custom ATD, and RTF (processing time divided by input duration).
4. Protocol for Aggregation, Normalization, and Reporting
All per-metric scores are aggregated and stored in JSON (per-system, per-direction) for transparency. For visual joint reporting (e.g., radar plots), metrics are normalized via uniform linear scaling to 2:
- If larger is better: 3
- If smaller is better: 4
Multi-metric axes are averaged post-normalization (e.g., translation quality as mean of BLEU, chrF++, COMET, BLEURT). Reports can be further customized via user-defined weights to produce a composite scalar for system ranking.
5. Datasets, Usage Workflow, and Reproducibility
OpenSTBench is distributed with code components for all stages:
openstbench/core/:eval_record.py(EvalRecord definition),runner.py(workflow execution).openstbench/evaluator/: Modular wrappers for each metric intext_metrics.py,speech_metrics.py,speaker_metrics.py,emotion_metrics.py,event_metrics.py,temporal_metrics.py.openstbench/configs/: Editable YAML configuration specifying systems, languages, and metrics.openstbench/datasets/: Loaders for public datasets—MSLT (translation/latency), LibriTTS-paired (speaker), RAVDESS and MCAE-SPPS (emotion), NonverbalTTS and SynParaSpeech (paralinguistic fidelity).
Experiments employ sampled and preprocessed subsets:
- MSLT dev: 1,000 samples/direction for translation, speech, temporal, and latency evaluation.
- LibriTTS-paired: 300 samples for speaker preservation.
- RAVDESS, MCAE-SPPS: 1,440/1,029 for emotion.
- NonverbalTTS, SynParaSpeech: 359/500 for event fidelity.
Running a full evaluation requires a YAML config describing the system (mode, system_type, language pair, paths to outputs and source data). The main runner.py script executes the pipeline, with preprocessing enforcing .wav 16kHz for audio and event tags aligned via VAD+CLAP.
6. Empirical Findings Across Evaluated Systems
Cross-system experiments yield key insights:
- Translation vs. Speech Quality: Qwen3-LiveTranslate achieves top BLEU/COMET but is matched or outperformed in UTMOS and CER/WER by Doubao AST 2.0 and UniSS. Speech transcription fidelity (CER/WER) and perceived naturalness (UTMOS) are not consistently correlated.
- Paralinguistic Event Preservation: All systems achieve low event content F1 (<0.15) and event timing F1 (<0.09), indicating that acoustic event information is rarely preserved.
- Temporal Consistency and Latency: UniSS produces near-perfect SLC (5), but at a higher RTF (1.54 vs. 0.30 for SeamlessM4T). In streaming settings, Doubao achieves the lowest Start Offset (~2.3s), Qwen3 yields lowest Custom ATD (3.45s), and GPTRT provides the best SLC6 (0.64).
- System Trade-offs: No evaluated system dominates all metric dimensions, underscoring the necessity for application-specific trade-off analysis.
7. Protocol Recommendations and Extensibility
Best practices recommended by OpenSTBench include:
- Report metrics from all relevant dimensions—translation quality, speech quality, temporal consistency—rather than relying solely on BLEU.
- Use same-language, same-speaker anchors for speaker preservation comparisons; avoid cross-language speaker evaluations.
- Normalize scores into fixed ranges before aggregating for system comparison.
- Distinguish between “streaming S2ST” and “streaming-input S2TT,” as only the former produces incremental audio output.
- Release system outputs under the shared EvalRecord schema for compatibility and plug-and-play evaluation.
- Validate automatic speech-side metrics, such as UTMOS and speaker similarity, with human judgments where feasible.
The open API and modular architecture of OpenSTBench enable straightforward extension to new language pairs, system paradigms, and metrics, supporting reproducible and standardized evaluation across the evolving landscape of speech translation research (An et al., 29 May 2026).