Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenSTBench: Unified Speech Translation Eval

Updated 4 June 2026
  • OpenSTBench is an open-source Python-based framework that unifies the multi-dimensional evaluation of heterogeneous speech translation outputs.
  • It standardizes assessments across both speech-to-text and speech-to-speech systems using metrics like BLEU, UTMOS, and latency measures for comprehensive analysis.
  • The framework supports reproducible experiments through a unified EvalRecord schema, configurable reporting, and detailed visualizations.

OpenSTBench is an open-source, Python-based multidimensional evaluation framework designed to unify and standardize the assessment of heterogeneous speech translation outputs. Leveraging a common evaluation format, OpenSTBench jointly evaluates speech-to-text translation (S2TT) and speech-to-speech translation (S2ST) systems across both offline and streaming modalities. The framework enables comprehensive analysis of translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency, thus supporting application-oriented comparison of modern speech translation systems (An et al., 29 May 2026).

1. Architecture and Data Flow

The OpenSTBench architecture decomposes into four primary stages:

  1. Input Processing and Normalization: A system-adapter ingests raw system outputs, including text transcripts, audio files, and timing signals. Speech outputs can be transcribed with automatic speech recognition (ASR) systems such as Whisper, while streaming outputs include granular timestamps; offline outputs assume utterance-level timing.
  2. Shared Evaluation-Record Conversion: All processed data are organized into a unified Python object, EvalRecord, with fields:
    • id, source_audio, source_segments (with timestamps), source_text
    • hyp_text (for S2TT and transcribed S2ST), hyp_audio (for S2ST)
    • hyp_timestamps (for streaming), and system-level metadata
  3. Metric Computation: A registry of Evaluator modules operates over EvalRecord instances, selectively applying metrics based on system type:
    • Text: BLEU, chrF++, COMET, BLEURT
    • Speech: UTMOS, CER/WER, speaker similarity, emotion similarity, event-detection F1
    • Temporal: Speech Length Compliance (SLC), latency metrics, Real-Time Factor (RTF)
  4. Result Aggregation and Reporting: Sample-level scores are accumulated and averaged. Output is generated as user-configurable JSON/CSV summaries, with optional radar-plot visualization.

Heterogeneous output unification is achieved by converting all system outputs into (hyp_text, hyp_audio, hyp_timestamps) representations, with fields left empty or skipped as appropriate for S2TT or S2ST modes (An et al., 29 May 2026).

2. Supported Modalities and Operational Settings

OpenSTBench supports:

  • System Types:
    • Speech-to-Text Translation (S2TT)
    • Speech-to-Speech Translation (S2ST)
  • Operating Modes:
    • Offline mode (full-utterance input and output)
    • Streaming mode (incremental input with partial outputs)
  • Streaming Alignment:
    • Partial text outputs are timestamped per token; partial audio outputs use segment-end timestamps.
    • A SimulEval-style server/client interface feeds audio chunks and records output times.
    • Alignment of hypothesis text tokens to input segments is performed by matching token timestamps with audio chunk boundaries.

This structure enables systematic comparison of systems with divergent output modalities, temporal behaviors, and interface protocols.

3. Evaluation Dimensions and Metrics

OpenSTBench implements a broad suite of metrics spanning the following dimensions:

Dimension Metrics Applicable System Types
Translation Quality BLEU, chrF++, COMET, BLEURT S2TT, S2ST
Speech Quality UTMOS, CER/WER S2ST
Speaker Preservation Cosine sim. (Resemblyzer/WavLM) S2ST
Emotion Preservation Emotion2Vec sim., classification accuracy S2ST
Paralinguistic Fidelity Event Content F1, Event Timing F1 (CLAP) S2ST
Temporal Consistency SLC_τ (Speech Length Compliance) S2TT, S2ST
Latency Start Offset, ATD, Custom ATD, RTF S2TT-stream, S2ST

Key metric formulas:

  • BLEU:

BLEU=BPexp(n=14wnlogpn)\mathrm{BLEU} = \mathrm{BP} \exp\left(\sum_{n=1}^4 w_n \log p_n\right)

  • CER/WER:

WER=S+D+IN×100%\mathrm{WER} = \frac{S + D + I}{N} \times 100\%

  • Speaker similarity:

sim(x,y)=E(x),E(y)E(x)E(y)\mathrm{sim}(x, y) = \frac{\langle E(x), E(y) \rangle}{\|E(x)\|\|E(y)\|}

  • Speech Length Compliance:

SLCτ=1Nsamples1[1τ,1+τ](ratio)\mathrm{SLC}_\tau = \frac{1}{N_{\text{samples}}}\sum \mathbf{1}_{[1-\tau,1+\tau]}(\mathrm{ratio})

  • ATD (Average Token Delay):

ATD=1Tt=1T(τtt)\mathrm{ATD} = \frac{1}{T}\sum_{t=1}^T (\tau_t - t)

Metrics are selectively run based on the output type, with speech-side metrics skipped for S2TT and certain temporal metrics omitted for non-streaming S2ST.

4. Reporting Protocol and Normalization

All metric scores—kept in their original units—are serialized per-system and per-language-direction to JSON files. For visualization (radar plots), scores are normalized according to fixed ranges: for metrics where larger is better, s(x)=clip((xa)/(ba),0,1)s(x)=\mathrm{clip}((x-a)/(b-a),0,1); for metrics where smaller is better, s(x)=clip((bx)/(ba),0,1)s(x)=\mathrm{clip}((b-x)/(b-a),0,1). Axes tied to multiple metrics (e.g., translation quality spanning BLEU, chrF++, COMET, BLEURT) are averaged post-normalization. Final multi-dimensional comparisons may incorporate user-defined weights to yield a single scalar, though all primary raw and normalized results are retained for inspection.

5. Reproducibility, Usage, and Dataset Integration

The open-source codebase (https://github.com/sjtuayj/OpenSTBench) is structured for extensibility and reproducibility:

  • Code Organization:
    • openstbench/core/: Orchestration and EvalRecord
    • openstbench/evaluator/: Metric wrappers for text, speech, speaker, emotion, paralinguistic, and temporal evaluators
    • openstbench/configs/: System configurations
    • openstbench/datasets/: Loaders for reference datasets (MSLT, LibriTTS_paired, RAVDESS, MCAE-SPPS, NonverbalTTS, SynParaSpeech)
  • Usage Example:

1
2
3
python -m openstbench.runner \
  --config configs/my_system.yaml \
  --output_dir ./results/my_system
The configuration YAML specifies modality, system type, language pair, output data locations, and dataset paths.

  • Included Datasets and Splits:
    • MSLT dev: translation quality, speech quality, temporal consistency, latency (1000 samples per direction)
    • LibriTTS_paired: speaker preservation (300 samples)
    • RAVDESS, MCAE-SPPS: emotion preservation (1440/1029 samples)
    • NonverbalTTS, SynParaSpeech: paralinguistic fidelity (359/500 samples)
    • Preprocessing: audio is resampled to 16kHz WAV, with fixed-size subsampling and anchor/label generation as needed.
  • Output Structure: Evaluations generate per-sample and aggregate metric arrays, exported for downstream statistical analysis or visualization.

6. Experimental Findings

Evaluation of representative speech translation systems with OpenSTBench demonstrates differentiated system performance across evaluation axes:

  • Translation vs. Speech Quality: Systems like Qwen3-LiveTranslate achieve peak scores on BLEU and COMET, while Doubao AST 2.0 and UniSS excel on UTMOS and CER/WER, indicating that high translation quality does not imply superior speech quality. Notably, lower CER/WER does not always correspond to higher subjective naturalness (UTMOS).
  • Paralinguistic Fidelity: All tested systems achieve low Event Content F1 (<0.15) and Event Timing F1 (<0.09), highlighting widespread challenges in preserving acoustic events through translation.
  • Temporal Consistency and Latency: UniSS achieves near-perfect SLC (>0.99) but at the cost of higher RTF (1.54), while SeamlessM4T is faster (RTF=0.30) but less temporally consistent. In streaming, Doubao exhibits the lowest Start Offset (≈2.3s), Qwen3 achieves the lowest Custom ATD (3.45s), and GPTRT demonstrates the best SLC_0.4 (0.64). No system outperforms others on all axes, indicating inherent trade-offs dependent on application context.

7. Best Practices and Practical Recommendations

OpenSTBench provides a set of recommendations for robust, reproducible speech translation evaluation:

  • Report multiple evaluation dimensions (translation, speech, temporal metrics), not BLEU alone.
  • Speaker preservation should utilize same-language, same-speaker anchors; cross-language comparison is discouraged.
  • System comparisons must use fixed-range normalization for heterogeneous outputs.
  • Distinguish streaming S2ST from streaming-input S2TT in reporting, as streaming S2TT systems only emit final transcripts.
  • System outputs should be released in the standardized EvalRecord schema to facilitate plug-and-play benchmarking.
  • Validate automatic speech-side metrics (e.g., UTMOS, speaker similarity) through periodic human annotation where feasible.

OpenSTBench's unified, extensible design and open API facilitate addition of new language pairs, metrics, or system types, ensuring results are fully reproducible and supporting its adoption as a standard for end-to-end speech translation evaluation (An et al., 29 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenSTBench Framework.