EmoAra: Emotion-Preserving Speech Pipeline

Updated 8 February 2026

EmoAra is an end-to-end pipeline for emotion-preserving cross-lingual spoken communication, integrating SER, ASR, MT, and TTS modules.
It employs a CNN-based SER achieving an F1-score of 0.94, Transformer models for transcription and translation, and emotion-conditioned TTS with 85% emotion congruence.
The system demonstrates practical application in banking by accurately conveying emotional nuances during cross-lingual communication.

EmoAra is an end-to-end pipeline designed for emotion-preserving cross-lingual spoken communication, with a deployment scenario motivated by banking customer service where the accurate conveyance of emotional nuance can significantly affect service outcomes. The pipeline integrates four major modules: Speech Emotion Recognition (SER), Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS), forming a sequential architecture that operates on English audio input to produce emotionally congruent Arabic speech output (Hassan et al., 1 Feb 2026).

1. System Architecture and Workflow

EmoAra processes input as a cascade:

Speech Emotion Recognition (SER): A convolutional neural network (CNN)-based classifier is applied to the source English audio. Input features comprise zero-crossing rate (ZCR), root-mean-square energy (RMSE), and Mel-frequency cepstral coefficients (MFCCs), with data augmentation strategies including Gaussian noise, pitch shift, time-stretch, and time-shift. The CNN architecture consists of three successive 1D Conv–ReLU–BatchNorm–MaxPool–Dropout blocks, followed by dense layers, producing an 8-class softmax prediction with categorical cross-entropy loss:

$L = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C y_{i,c} \log p_{i,c}$

where $y_{i,c}$ is the one-hot label and $p_{i,c}$ is the predicted probability.

Automatic Speech Recognition (ASR): Whisper Base: The Whisper Base model (74M parameters) is employed off-the-shelf for robust English transcription. The system extracts log-Mel spectrogram features which are processed by the encoder-decoder Transformer to output an English token sequence. Decoding uses greedy or default beam search.
Machine Translation (MT): MarianMT: The module uses a standard Transformer architecture (multi-head attention, encoder-decoder) fine-tuned on both general and domain banking English–Arabic parallel datasets. Input preprocessing includes normalization, script-based filtering, and tokenization to 128 tokens max. The translation head is trained using token-level cross-entropy:

$L = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C y_{i,c} \log p_{i,c}$

Decoding employs beam search (width 8).

Text-to-Speech (TTS): MMS-TTS-Ara: A Transformer-based text encoder generates latent linguistic representations, followed by a sequence generator producing a mel-spectrogram, which is then converted to waveform via a HiFi-GAN vocoder. The TTS is conditioned on emotion embeddings from the SER, modulating prosodic attributes (pitch, energy) to reconstruct the detected emotion in output Arabic speech. Training utilises a combination of mel-reconstruction L1 loss and adversarial loss.

The full flow is: [English Audio] → [SER CNN → Emotion Label] → [Whisper ASR → English Transcript] → [MarianMT → Arabic Text] → [MMS-TTS-Ara (Emotion-conditioned) → Arabic Speech].

2. Data Sources and Training Protocols

The constituent modules draw from the following resources:

SER: RAVDESS (1,440 .wav clips, 8 emotions, split 85%/7.5%/7.5% for train/dev/test).
ASR: Whisper Base pre-trained, requiring no additional ASR-specific training data.
MT: 24,000 English–Arabic general domain pairs and 10,000 Banking77 domain pairs, 80/10/10% splits.
TTS: MMS-TTS-Ara, pre-trained on extensive Arabic corpora, no additional fine-tuning.
Training regime: Batch 64 (SER), Adam optimizer (SER: LR = 1e-3, 100 epochs), MarianMT (LR = 3e-5, 1,000 warm-up, 10 epochs, FP16, AdamW, batch size 32 effective with gradient accumulation), trained on RTX 3060/Ryzen 3700X/64GB RAM or optionally A100/T4 for MT.

Audio is processed at 16kHz mono, with silence trimming and augmentation for SER. Text data undergoes normalization, script filtering, and tokenization.

3. Evaluation Metrics and Experimental Results

Key performance metrics and outcomes for each module are as follows:

SER: The implemented CNN attained an average F1-score of 0.94 (baseline 0.53), with classwise F1 ranging from 0.92 (calm/fear/sad) to 0.97 (angry).
MT: On the Banking77 domain, fine-tuned MarianMT reached BLEU 56.0 (95% CI: 54.5–57.5), BERTScore F1 88.7%. Baselines were BLEU 23.78 (from-scratch), 25.48 (pre-trained), and BERT 67/73% respectively.
- BLEU calculation:
$BLEU = BP \cdot \exp\left(\sum_{n=1}^N w_n \log p_n\right)$

with $BP$ the brevity penalty. - BERTScore F1:

$F1 = \frac{2PR}{P+R}$

where $P$ and $R$ are defined over token embedding cosine similarities.
TTS and Emotion Preservation: Subjective listening tests (N=20) found an 85% congruence rate for emotion match pre/post translation. Objective prosodic alignment (F0 correlation 0.72; energy correlation 0.68).
Human evaluation: For 100 banking-domain test sentences rated on 1–3 rubric (accuracy, fluency, terminology), average human score was 81%.

EmoAra employs a CNN-based SER module to extract and downstream emotional context, crucial for customer-facing systems. The emotion label conditions the TTS, modulating its output to enhance cross-lingual emotional fidelity. Compared to approaches like ArabEmoNet (Abouzeid et al., 1 Sep 2025), which utilizes a hybrid 2D CNN-BiLSTM with attention on Arabic data, EmoAra emphasizes cross-lingual transfer and end-to-end integration. ArabEmoNet demonstrates that operating on log-Mel spectrograms (vs. MFCCs in 1D CNNs) and leveraging bidirectionality and soft attention further improves Arabic emotion recognition at reduced parameter count. Both systems operationalize categorical cross-entropy as the loss, but differ in feature representation and architectural depth.

A plausible implication is that further improvements in EmoAra’s SER, especially under domain/data shift, may arise from lightweight and spectro-temporally-aware models as exemplified by ArabEmoNet.

5. Implementation and Open-Source Availability

The EmoAra codebase is provided at https://github.com/besherhasan/Emotion-Driven-Speech-Transcription-and-Cross-Lingual-Translation-with-Arabic-TTS-Integration. Modules include scripts for SER training (/src/ser_train.py), ASR inference (/src/asr_whisper.py), MT fine-tuning/inference (/src/mt_finetune.py, /src/mt_translate.py), TTS synthesis (/src/tts_synthesize.py), and evaluation tools (/src/evaluate.py). Reproduction instructions, environment setup, and demo notebooks are included.

6. Limitations and Prospects

EmoAra is subject to several constraints:

SER depends on acted RAVDESS data, possibly reducing generalization to spontaneous speech.
MT is limited by token truncation for long sentences (>20 words), potentially dropping relevant content.
TTS prosody conditioning is heuristic, which may inadequately capture subtle emotional shifts.
No in-domain SER training data is currently utilized.

Future enhancements include: expanding training data via back-translation/paraphrasing, integrating adapter layers or LoRA for efficient MT fine-tuning, in-domain SER data collection and fine-tuning, and optimizing TTS emotional renderings via large-scale subjective A/B tests.

EmoAra demonstrates the feasibility of integrating cross-lingual speech technologies with explicit emotion preservation, targeting high-stakes communication domains such as banking customer service, where empathetic machine translation and synthesis substantially impact interaction quality (Hassan et al., 1 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech (2026)

ArabEmoNet: A Lightweight Hybrid 2D CNN-BiLSTM Model with Attention for Robust Arabic Speech Emotion Recognition (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EmoAra.

EmoAra: Emotion-Preserving Speech Pipeline

1. System Architecture and Workflow

2. Data Sources and Training Protocols

3. Evaluation Metrics and Experimental Results

5. Implementation and Open-Source Availability

6. Limitations and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

EmoAra: Emotion-Preserving Speech Pipeline

1. System Architecture and Workflow

2. Data Sources and Training Protocols

3. Evaluation Metrics and Experimental Results

4. Emotion Recognition, Preservation, and Related Systems

5. Implementation and Open-Source Availability

6. Limitations and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research