CLASI: Cross-Language Real-Time Interpretation
- CLASI is a modular, end-to-end AI framework for real-time, high-fidelity multilingual interpretation, integrating ASR, segmentation, translation, and TTS modules.
- It leverages advanced techniques such as DPO-tuned segmentation, context-aware translation, and adaptive policy management to ensure minimal latency and maximal semantic fidelity.
- Evaluation metrics like VIP, AL, and BLEU show that CLASI outperforms existing systems, achieving sub-3 second latency in high-stakes, live multi-language communications.
A Cross Language Agent for Simultaneous Interpretation (CLASI) is a modular, end-to-end artificial intelligence system that performs high-fidelity, real-time translation of live spoken language across multiple languages, emulating and extending the cognitive and operational strategies of professional human interpreters. Modern CLASI systems integrate advances in speech recognition, linguistic segmentation, context-aware sequence modeling, adaptive translation policies, human-aligned chunking, and ultra-low-latency speech synthesis, often leveraging LLMs, preference tuning, and domain-adaptive retrieval to achieve both semantic fidelity and practical latency. CLASI targets deployment in environments such as conferences, large-scale expositions, live streaming, and high-stakes multi-lingual communications, with evaluation focused on metrics like Valid Information Proportion (VIP), Average Lagging (AL), and human-centered usability thresholds (Cheng et al., 2024, Cheng et al., 23 Jul 2025, Yang et al., 14 Oct 2025, Paul et al., 1 May 2026, Zhang et al., 16 Jan 2026).
1. System Architecture and Workflow
CLASI systems are highly modular and designed for streaming, ultra-low-latency operation. The canonical architecture comprises the following components:
- Front-end ASR: Processes incoming speech in real time, emitting partial transcripts with word-level timestamps under strict latency requirements (often <200 ms per segment) (Fantinuoli et al., 2022, Paul et al., 1 May 2026).
- Segmentation Module: Dynamically segments the text or audio stream into semantically coherent “chunks” or Information Units (IUs), using human-aligned segmentation policies implemented via LLMs with preference tuning (e.g., Direct Preference Optimization, DPO) (Yang et al., 14 Oct 2025, Paul et al., 1 May 2026, Xiong et al., 2019).
- Context-Aware Translation Engine: Performs chunk-wise or prefix-to-prefix neural machine translation, leveraging not only the current chunk but past context, retrieved terminology, and domain/speaker tags. Models typically use adaptive attention mechanisms and, for highly distant language pairs, may incorporate chunk monotonicity constraints (Xiong et al., 2019, Paul et al., 1 May 2026, Doi et al., 2024).
- Post-processing & Policy Manager: Facilitates adaptive restructuring, including strategic actions such as SENTENCE_CUT, DROP, PARTIAL_SUMMARIZATION, and PRONOMINALIZATION, implemented via a decoder-only LLM trained with action-aware prompting (Zhang et al., 16 Jan 2026, Zhang et al., 26 Sep 2025).
- Speech Synthesis / Voice Cloning: Converts the generated translation into real-time target language audio, often with speaker voice cloning and prosody preservation (Cheng et al., 23 Jul 2025).
- User Interfaces & Feedback Loops: Provide visual or auditory overlays, terminology suggestions, and customizability (e.g., latency–precision sliders), feeding user/interpreter corrections back for continual adaptation (Vogler et al., 2019, Fantinuoli et al., 2022).
Data and compute considerations: Modern CLASI deployments often mix offline and SI-style (simultaneous interpretation) corpora, using dual-mode models with style tags for real-time style control (Ko et al., 2023).
2. Segmentation and Chunking Strategies
Online segmentation is critical to the latency–quality trade-off. Recent CLASI systems employ preference-aligned segmentation models trained with DPO, surpassing earlier heuristic or supervised methods (e.g., SHAS, VAD, or fixed-length chunking) (Yang et al., 14 Oct 2025). The Qwen2.5-Omni-3B LLM, fine-tuned with segment pairs ranked by BLEU and latency, predicts semantically optimal breakpoints via a sliding window over acoustic features. Key findings include:
- Segmentation accuracy (SegAcc) improves by 4–5% over SHAS; BLEU gains are 1.5–2 points, with sub-100 ms impact on latency.
- Human-aligned segmentation minimizes over- or under-segmentation, maintains inter-cut interval to at least 500 ms, and adapts thresholds per language (e.g., En→Ja requires delayed cuts to preserve clause integrity).
During deployment, chunking decisions are continuously updated based on real-time feedback, and overlapping context is maintained to prevent information loss at chunk boundaries (Yang et al., 14 Oct 2025, Xiong et al., 2019).
Representative comparison of segmentation methods:
| Method | SegAcc (%) | BLEU | AL (ms) |
|---|---|---|---|
| Fixed–length | 75–76 | 16–18 | ≈3,000 |
| SHAS | 78–83 | 17–23 | ≈3,100 |
| DPO-tuned LLM | 83–88 | 18–25 | 3,100 |
(Note: Across En→De, En→Ja, En→Zh on the ACL 60/60 corpus) (Yang et al., 14 Oct 2025)
3. Translation and Policy Adaptation
Modern CLASI translation modules implement either end-to-end chunk-wise translation or adaptive policy networks incorporating both READ/WRITE actions and human-derived restructuring operations. The policy inventory typically comprises:
- READ/WRITE: Classic prefix-to-prefix streaming paradigm.
- SENTENCE_CUT: Enforces strategic segmentation mirroring the "salami technique," reducing reordering and latency.
- DROP: Omits filler or redundant material with no information loss, supported by empirical BLEU and latency gains.
- PARTIAL_SUMMARIZATION: Merges repetitive or overlapping clauses.
- PRONOMINALIZATION: Compresses repeated referents to pronouns (Zhang et al., 16 Jan 2026, Zhang et al., 26 Sep 2025).
Training leverages action-annotated data, often produced via action-aware prompting of an LLM (e.g., GPT-4o), and uses fine-tuning with cross-entropy loss over sequences of actions and target tokens. Latency–quality trade-offs are estimated per action (e.g., DROP ≈0.85 s AL, BLEU≈59) to inform adaptive runtime decisions (Zhang et al., 16 Jan 2026).
4. Context Awareness, Retrieval, and Error Mitigation
CLASI modules employ contextual embeddings spanning past and current chunks, speaker/scene/domain tags, as well as multi-modal retrieval for in-domain terminology (Paul et al., 1 May 2026, Cheng et al., 2024). The “multi-modal retrieving module” outputs top-k translation candidates for rare or OOV terms via cross-attention between representations of the current audio/text and a key-value terminology database.
- Retrieval-augmented generation demonstrated a top-10 recall of 91.3% for rare terminology, with a ~10 VIP point drop observed if omitted (Cheng et al., 2024).
- Error-tolerant decoding is achieved by fusing retrieved information and historical context during LLM decoding, enabling robust handling of ASR errors or translational ambiguity.
Some CLASI frameworks further employ tree-based beam search to maintain multiple potential target streams, committing partial outputs only when a confidence threshold is met, reducing both perceived lag and error propagation (Iida et al., 2024).
5. Evaluation Metrics and Human Parity
Standard corpus metrics (BLEU, BLEURT, COMET, chrF, TER) are routinely reported, but CLASI research emphasizes human-aligned “information fidelity” metrics:
- Valid Information Proportion (VIP): Percentage of semantic units in the source faithfully conveyed in the translation, assessed fragment-by-fragment by professional interpreters (Cheng et al., 2024, Cheng et al., 23 Jul 2025).
- SVIP: VIP extended to the speech domain, including audio fidelity and fluency.
- Latency metrics: Average Lagging (AL), Length-Adaptive AL (LAAL), First-Letter Appearance Lagging (FLAL), Time-based LAAL (LAAL_sec), and Equilibrium Efficiency (EE) are reported to capture translation output timing and streaming performance (Cheng et al., 23 Jul 2025, Zhang et al., 16 Jan 2026, Xiong et al., 2019).
Empirical results:
- CLASI systems achieve VIP of 81.3% (zh→en) and 78.0% (en→zh) vs. best commercial baselines at 35–42%; on extremely hard sets, CLASI maintains 70%+ VIP while all baselines score <13% VIP (Cheng et al., 2024).
- End-to-end speech-to-speech AL is routinely <3 s, matching or improving on professional interpreter latency (Cheng et al., 23 Jul 2025, Xiong et al., 2019, Fantinuoli et al., 2022).
- BLEU/COMET improvements of 2–5 points and consistent reductions in lag documented for action-policy and DPO-tuned systems (Yang et al., 14 Oct 2025, Zhang et al., 26 Sep 2025).
6. Practical Deployment, Latency, and User Adaptation
Operational deployments require rigorous latency budget management. The sum of component latencies—audio capture, transmission, ASR, segmentation, translation, TTS, and UI rendering—must remain below cognitive acceptability thresholds for interpreters. Empirical studies show interpreters tolerate up to 3 s of end-to-end system latency without significant accuracy or fluency loss; beyond this, accuracy and usability decline rapidly (Fantinuoli et al., 2022). Real-world deployments (e.g., Expo 2025 Osaka) report median end-to-end per-chunk latencies of ≈400 ms, confirming feasibility at scale (Paul et al., 1 May 2026).
CLASI user interfaces can expose live latency–precision controls, display terminology suggestions, and collect feedback for further preference tuning. Interpreter-centric metrics such as Ear–Voice Span (EVS) are monitored in real time to adapt segmentation or chunk size dynamically (Cheng et al., 2024, Vogler et al., 2019).
7. Language Pair Challenges and Future Directions
Highly divergent language pairs (e.g., English–Japanese) present distinct monotonicity and syntactic reordering challenges. Data-driven monotonic chunking algorithms, chunk-level monotonicity constraints, and optimizations that expand chunk size at syntactic boundaries (e.g., prepositional phrase, post-modifier) are necessary to maintain fidelity and minimize unnecessary repeats or omissions (Doi et al., 2024). Mixed-corpus, style-tagged models allow real-time mode switching between SI and offline output styles, offering robust solutions for diverse event scenarios (Ko et al., 2023).
Future research directions:
- Expansion to >100 languages, robust low-resource adaptation, and more advanced persona and prosody modeling.
- Integration of additional multimodal cues (visual, environmental) for disambiguation (Iida et al., 2024).
- Deployment of continually updated human–machine preference learning pipelines for segmentation and policy modules.
- Scalable architectures with hybrid edge/cloud inferencing to meet sub-3 s end-to-end latency constraints in variable network conditions (Paul et al., 1 May 2026, Fantinuoli et al., 2022).
References:
- (Vogler et al., 2019) Lost in Interpretation: Predicting Untranslated Terminology in Simultaneous Interpretation
- (Cheng et al., 23 Jul 2025) Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
- (Cheng et al., 2024) Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent
- (Paul et al., 1 May 2026) Language-free Experience at Expo 2025 Osaka
- (Zhang et al., 16 Jan 2026, Zhang et al., 26 Sep 2025) Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies
- (Ko et al., 2023) Tagged End-to-End Simultaneous Speech Translation Training using Simultaneous Interpretation Data
- (Yang et al., 14 Oct 2025) DPO-Tuned LLMs for Segmentation in Simultaneous Speech Translation
- (Xiong et al., 2019) DuTongChuan: Context-aware Translation Model for Simultaneous Interpreting
- (Doi et al., 2024) Word Order in English-Japanese Simultaneous Interpretation: Analyses and Evaluation using Chunk-wise Monotonic Translation
- (Iida et al., 2024) Predictive Simultaneous Interpretation: Harnessing LLMs for Democratizing Real-Time Multilingual Communication
- (Fantinuoli et al., 2022) Defining maximum acceptable latency of AI-enhanced CAI tools