Service-Oriented TTS Architecture
- Service-Oriented TTS Architecture is a modular framework that decouples computationally intensive, context-aware phonemization from the core text-to-speech engine for improved speed and accuracy.
- It integrates a lightweight embedded phonemizer with a separate context-aware service that refines phoneme sequences using statistical disambiguation and distilled neural models for Ezafe detection.
- Empirical results show reduced phoneme error rates and latency improvements, making the architecture effective for both low-resource environments and high-performance applications.
A service-oriented text-to-speech (TTS) architecture is a modular framework that enables real-time, high-quality speech synthesis by decoupling computationally intensive, context-aware phonemization modules from the core TTS engine. This design addresses the trade-off between inference speed and linguistic accuracy in G2P-aided TTS systems, facilitating improved pronunciation and prosody within strict latency constraints and on resource-limited devices (Fetrat et al., 8 Dec 2025).
1. System Components and Workflow
The service-oriented TTS system is composed of three principal modules:
- Core TTS Engine ("Piper P2S"): Implements a non-autoregressive phoneme-to-speech model, exported to ONNX, running as a persistent process.
- Lightweight Phonemizer (eSpeak-based): Handles default grapheme-to-phoneme conversion, embedded directly within the core to generate an initial phoneme sequence.
- Context-Aware Phonemizer Service: Run as a separate, long-lived process, this module includes:
- Statistical homograph disambiguation based on co-occurrence statistics.
- Ezafe detection via a distilled ALBERT POS-tagger (ONNX format).
The system operates via the following sequence:
- Core TTS receives raw text input.
- Embedded lightweight phonemizer generates initial phonemes.
- Core TTS writes a request (containing phonemes and optional word context) to a named pipe input file.
- The context-aware service refines the phoneme sequence (correcting homographs and inserting Ezafe) before writing the output to a response pipe.
- Core TTS retrieves the enhanced phoneme sequence and proceeds with P2S synthesis to generate the audio waveform.
The overall workflow allows for flexible, on-demand incorporation of high-quality linguistic normalization without incurring module load overhead per request (Fetrat et al., 8 Dec 2025).
2. Service Interfaces and Communication Protocol
Interprocess communication is established using UNIX-style or Windows named pipes for bidirectional, low-overhead message streaming. Both core and service processes initialize at system startup to avoid per-request startup latency.
The API leverages JSON message schemas:
- Request:
1 2 3 4 5 |
{
"request_id": "<UUID>",
"phonemes": ["p1","p2",…],
"context_tokens": ["w−2","w−1","w","w+1","w+2"]
} |
- Response:
1 2 3 4 5 6 7 8 |
{
"request_id": "<UUID>",
"enhanced_phonemes": ["p1′","p2′",…],
"corrections": {
"homograph_positions": [i,j,…],
"ezafe_insertions": [k,l,…]
}
} |
API usage consists of writing and flushing JSON requests to the pipe and reading/processing JSON responses. This protocol supports streaming, batching, and asynchronous I/O, creating potential for parallel processing within the TTS pipeline.
3. Latency and Parallel Execution
Let the following denote timing components:
- : one-time core TTS load,
- : one-time service load,
- : lightweight phonemizer inference,
- : homograph disambiguation time,
- : Ezafe detection time,
- : per-request pipe I/O overhead,
- : phoneme-to-speech synthesis time.
The per-request end-to-end latency is:
With asynchronous I/O and parallelization, the critical path is:
This structure ensures that phonemization corrections can be processed concurrently with speech synthesis. When the context modules are faster than P2S, there is no significant impact on the critical path, maintaining real-time responsiveness (Fetrat et al., 8 Dec 2025).
4. Context-Aware Phonemization Strategies
Statistical Homograph Disambiguation
Ambiguous words are pre-indexed with their possible pronunciations and context co-occurrence statistics . During inference, the score for each pronunciation is computed from the sum of co-occurrence counts with surrounding context tokens. The optimal pronunciation is selected via:
Phoneme replacements are applied according to .
Distilled Ezafe Detector
Ezafe detection is treated as a sequence labeling/POS-tagging task:
- Student model: ALBERT-Persian (11M parameters), distilled and fine-tuned on SpaCy-labeled Ezafe tags.
- Inference per token boundary :
where is the ALBERT hidden state.
The context-aware service applies the functions as a pipeline:
Here, denotes the initial phoneme sequence, and is the enhanced output.
5. Quantitative Performance and Comparative Metrics
Empirical evaluation demonstrates significant improvements in phonemization accuracy and real-time throughput:
| Model | PER ↓ | EzF1 ↑ | HomAcc ↑ | MOS ↑ | RTF_direct | RTF_srv |
|---|---|---|---|---|---|---|
| Piper (Base) | 6.32 | 19.58 | 43.87 | 2.41 | 0.153 | – |
| + Neural G2P | 4.95 | 87.70 | 74.53 | – | 3.840 | 0.396 |
| + LCA-G2P (service-oriented) | 4.80 | 90.08 | 77.67 | 3.14 | 5.519 | 0.167 |
Key metrics:
- PER (Phoneme Error Rate) decreases from 6.32 (base) to 4.80 (LCA-G2P).
- EzF1 (Ezafe F1-score) increases from 19.6 to 90.1.
- HomAcc (Homograph Accuracy) increases from 43.9 to 77.7.
- MOS (Mean Opinion Score) increases from 2.41 to 3.14.
- RTF (Real-Time Factor) for service-based design is reduced to 0.167, restoring near-baseline throughput.
For Ezafe detection, the distilled ALBERT model achieves 94.19% F1 with low memory and disk footprint compared to the SpaCy teacher.
| Model | #Params | Memory (MB) | Disk (MB) | EzF1 (%) | (s) ↓ |
|---|---|---|---|---|---|
| SpaCy Teacher | 162.8M | 621 | 1258 | 97.67 | 0.110 |
| ALBERT Student | 11.1M | 42.3 | 41.4 | 94.19 | 0.037 |
6. Architectural and Deployment Trade-Offs
Service-based TTS architectures enable flexible trade-offs between quality and computational cost:
- Integrating a full neural G2P (300M parameters) substantially improves Ezafe F1 and HomAcc but incurs high direct inference latency (RTF ≈ 3.84).
- LCA-G2P, leveraging service decoupling and lightweight models, yields a lower PER/EzF1 with a 15× smaller footprint and real-time RTF (0.167) on CPU.
- On low-resource devices (CPU, restricted RAM), deployment of distilled ALBERT and rule/statistical units is recommended.
- For high-end/offline scenarios (GPU, abundant RAM), more complex G2P (e.g., Homo-GE2PE or LLM-based) can be hosted, still benefiting from the same decoupled service interface and without rearchitecting the core engine. This suggests an extensible deployment pathway.
7. Contributions and Implications
Key contributions of the service-oriented TTS architecture include:
- Decoupling heavy phonemization from the core engine, eliminating per-request load overhead.
- Incorporation of a lightweight context-aware phonemizer combining statistical and distilled neural modules.
- Enabling PiperTTS to achieve real-time CPU throughput with notable improvements in pronunciation and linguistic features.
- Empirical validation, demonstrating enhancements in PER, Ezafe F1, Homograph Accuracy, and MOS—all with low-latency operation.
- Deployment guidance supporting both end-device and server/offline use cases under stringent latency and resource constraints.
A plausible implication is that modular, service-oriented designs can generalize to other language-specific or contextual normalization tasks in speech and language systems, facilitating diverse deployment scenarios without sacrificing latency or scalability (Fetrat et al., 8 Dec 2025).