TACA-VITS: Expressive Audiobook TTS
- TACA-VITS is a neural text-to-speech architecture that enhances audiobook synthesis through text-aware style embedding and context encoding.
- It leverages cross-modal contrastive learning to align text and speech in a shared style space, enabling diverse and coherent expressiveness.
- The model integrates a context encoder processing multi-sentence inputs to ensure long-form prosodic coherence without manual style annotations.
TACA-VITS is a neural text-to-speech (TTS) architecture for expressive audiobook synthesis that integrates a text-aware style embedding and a context encoder into the VITS 2.0 backbone. It is designed to capture a broad range of expressive speaking styles, as found in professional audiobook narration, directly from text inputs—without requiring manually labeled style annotations or reference speech at inference. TACA-VITS leverages cross-modal contrastive learning to align text and speech in a shared style space, and employs cross-sentence textual context to yield long-form, coherent, and highly expressive synthetic speech (Guo et al., 2024).
1. Architectural Foundations
TACA-VITS extends the single-stage VITS 2.0 architecture, which unifies variational inference, adversarial training, and normalizing flows for end-to-end TTS. In the vanilla VITS pipeline, a text encoder generates hidden representations from input phonemes . The prior parameterizes a normalizing-flow prior for the latent acoustic variable , the posterior infers from ground-truth speech , and a generator —paired with a multi-period discriminator —maps to the waveform.
TACA-VITS introduces two core augmentations:
- A text-aware style embedding 0, inferred purely from the input text and trained to inhabit a speech-derived “style space” via cross-modal alignment.
- A context encoder that processes not only the sentence in focus but also its neighboring sentence embeddings and the quantized style code, producing context-aware representations 1 that augment or replace the canonical 2 in all generative steps.
This architectural adjustment allows TACA-VITS models to handle text-predicted expressive styles and model prosody over multiple sentences, which is central to the audiobook domain where stylistic consistency and long-range prosodic coherence are required.
2. Text-Aware Style Space Learning
2.1 Speech-Style Encoder
A speech-style embedding 3 is extracted using an off-the-shelf SRL speech encoder initialized from HuBERT (layer 6). For each speech-text pair 4, 5 is computed. This embedding is spread out via semi-supervised contrastive loss, encouraging the formation of a diverse expressive latent manifold absent explicit labels.
2.2 Cross-modal Contrastive Training
A text encoder 6, initialized from Chinese T5, maps 7 to 8, which is trained to align with its speech counterpart and to preserve the structural geometry of speech-based style relations. The contrastive objective samples positive and negative text-text pairs based on the cosine similarity of their associated speech-style embeddings:
- Positive if 9
- Negative if 0
- Pairs with 1 are ignored.
The text-style loss combines an InfoNCE objective 2 and direct cosine alignment 3:
4
Only the T5 encoder is optimized in this stage; the speech encoder remains frozen.
A vector quantization (VQ) module discretizes 5 to a codebook index 6, promoting the reuse of a finite number of prototypical expressive styles and stabilizing generation.
3. Context Encoder Integration
The context encoder 7 combines:
- Phoneme embeddings from a three-sentence window (previous, current, next)
- The corresponding BERT embeddings
- The quantized style code 8
The network architecture consists of two 1D convolutional layers over concatenated phoneme-BERT inputs, a stack of four Transformer layers (dimension 768, 8 heads), injection of the style code via addition at each frame, and a final projection to 512 dimensions (VITS-compatible). During training, the context encoder is fine-tuned on top of a pre-trained VITS model—initializing from the official Bert-VITS 2.0 checkpoint and replacing its BERT block.
A two-phase procedure is used:
- Pre-train VITS on standard text input.
- Fine-tune with the full context encoder, sampling style codes from either speech or text encoders to promote generalization.
4. Training Regimen and Objective Functions
4.1 Loss Formulation
During fine-tuning, the model minimizes:
9
0 is the sum of VITS losses:
- KL divergence: 1
- Flow matching: 2
- Mel-spectrogram L1 loss: 3
- Adversarial loss: 4
Loss weights are fixed as 5, 6, 7.
During this phase, 8 is only back-propagated through the context encoder's style pathway; the text encoder weights are frozen.
4.2 Data and Optimization
The model is trained and fine-tuned on:
- 100H-Multi-Style (66 hours manually labeled) for the speech-style encoder
- 6kH unlabeled audiobook speech (auto-transcribed) for text-style contrastive learning
- 20H-Audiobook-HQ (utterance-ordered) for full-system fine-tuning
Optimizers are AdamW (T5, 9; speech encoder, 0) and Adam (1) for the TACA-VITS full system. VQ codebook has 2 entries.
5. Empirical Evaluation
5.1 Style Alignment
TACA-VITS’s text encoder achieves strong alignment with the speech-style encoder: cosine similarity 3 for the T5-based encoder versus 4 for a BERT-based baseline. t-SNE visualization of 450 sentences across 9 chapters shows that the learned text-style vectors cluster tightly by chapter but remain well-dispersed within clusters, reflecting both stylistic consistency and intra-chapter variability.
5.2 Objective and Subjective Metrics
Objective TTS performance is reported via Mel Cepstral Distortion (MCD, dB) and Character Error Rate (CER, %):
| TTS Model | MCD ↓ | CER(%) ↓ |
|---|---|---|
| VITS (base) | 4.23 | 5.80 |
| TACA-VITS | 4.21 | 5.99 |
| LM (base) | 8.92 | 13.9 |
| TACA-LM | 8.15 | 13.1 |
TACA-VITS maintains parity with baseline VITS in both metrics. For LM-based models, the TACA variant yields measurable improvements.
Subjective naturalness and expressiveness are assessed using MOS (1–5 scale, 20 native listeners):
| Model | NMOS↑ | EMOS↑ |
|---|---|---|
| VITS (base) | 3.84±0.11 | 3.61±0.14 |
| TACA-VITS | 3.90±0.10 | 3.93±0.11 |
| LM (base) | 2.91±0.09 | 3.80±0.11 |
| TACA-LM | 3.22±0.10 | 4.05±0.10 |
TACA-VITS yields a +0.06 gain in NMOS and +0.32 in EMOS over VITS. The improvements for TACA-LM over LM (0.31 NMOS, 0.25 EMOS) further confirm the effectiveness of the TACA methodology for both VITS- and LM-based TTS.
6. Analytical Perspectives
- Cross-modal contrastive learning taps large-scale unannotated audiobook data to induce a nuanced, high-dimensional style manifold from text. This facilitates fine-grained control of style and prosody without style labels or reference audio.
- Vector quantization in the style space promotes re-use of prototypical style embeddings, which contributes to synthesis stability and coverage of diverse expressive modes.
- The context encoder, with its access to past and future sentences, enables coherent long-form prosody—addressing a well-known shortcoming in prior sentence-level TTS systems.
- Text-based style prediction allows fully text-driven inference, removing the need for external style references and supporting scalable deployment.
- Both the text-style modeling and context encoder modules are modular and can be integrated into other single-stage TTS systems.
7. Significance and Future Directions
TACA-VITS advances the state of expressive TTS in the audiobook domain. By combining a robust text-predicted style embedding, context modeling over sentence windows, and a rigorously balanced VITS-style training objective, TACA-VITS achieves high-fidelity, expressive, and contextually coherent speech synthesis without explicit reference data or style annotation requirements. A plausible implication is that these mechanisms could generalize well across languages and genres, especially for long-form, multi-style synthesizer deployments (Guo et al., 2024).