Papers
Topics
Authors
Recent
Search
2000 character limit reached

TACA-VITS: Expressive Audiobook TTS

Updated 5 April 2026
  • TACA-VITS is a neural text-to-speech architecture that enhances audiobook synthesis through text-aware style embedding and context encoding.
  • It leverages cross-modal contrastive learning to align text and speech in a shared style space, enabling diverse and coherent expressiveness.
  • The model integrates a context encoder processing multi-sentence inputs to ensure long-form prosodic coherence without manual style annotations.

TACA-VITS is a neural text-to-speech (TTS) architecture for expressive audiobook synthesis that integrates a text-aware style embedding and a context encoder into the VITS 2.0 backbone. It is designed to capture a broad range of expressive speaking styles, as found in professional audiobook narration, directly from text inputs—without requiring manually labeled style annotations or reference speech at inference. TACA-VITS leverages cross-modal contrastive learning to align text and speech in a shared style space, and employs cross-sentence textual context to yield long-form, coherent, and highly expressive synthetic speech (Guo et al., 2024).

1. Architectural Foundations

TACA-VITS extends the single-stage VITS 2.0 architecture, which unifies variational inference, adversarial training, and normalizing flows for end-to-end TTS. In the vanilla VITS pipeline, a text encoder generates hidden representations hh from input phonemes xx. The prior pθ(zh)p_\theta(z|h) parameterizes a normalizing-flow prior for the latent acoustic variable zz, the posterior qϕ(zh,y)q_\phi(z|h,y) infers zz from ground-truth speech yy, and a generator Gθ(z)G_\theta(z)—paired with a multi-period discriminator DψD_\psi—maps zz to the waveform.

TACA-VITS introduces two core augmentations:

  • A text-aware style embedding xx0, inferred purely from the input text and trained to inhabit a speech-derived “style space” via cross-modal alignment.
  • A context encoder that processes not only the sentence in focus but also its neighboring sentence embeddings and the quantized style code, producing context-aware representations xx1 that augment or replace the canonical xx2 in all generative steps.

This architectural adjustment allows TACA-VITS models to handle text-predicted expressive styles and model prosody over multiple sentences, which is central to the audiobook domain where stylistic consistency and long-range prosodic coherence are required.

2. Text-Aware Style Space Learning

2.1 Speech-Style Encoder

A speech-style embedding xx3 is extracted using an off-the-shelf SRL speech encoder initialized from HuBERT (layer 6). For each speech-text pair xx4, xx5 is computed. This embedding is spread out via semi-supervised contrastive loss, encouraging the formation of a diverse expressive latent manifold absent explicit labels.

2.2 Cross-modal Contrastive Training

A text encoder xx6, initialized from Chinese T5, maps xx7 to xx8, which is trained to align with its speech counterpart and to preserve the structural geometry of speech-based style relations. The contrastive objective samples positive and negative text-text pairs based on the cosine similarity of their associated speech-style embeddings:

  • Positive if xx9
  • Negative if pθ(zh)p_\theta(z|h)0
  • Pairs with pθ(zh)p_\theta(z|h)1 are ignored.

The text-style loss combines an InfoNCE objective pθ(zh)p_\theta(z|h)2 and direct cosine alignment pθ(zh)p_\theta(z|h)3:

pθ(zh)p_\theta(z|h)4

Only the T5 encoder is optimized in this stage; the speech encoder remains frozen.

A vector quantization (VQ) module discretizes pθ(zh)p_\theta(z|h)5 to a codebook index pθ(zh)p_\theta(z|h)6, promoting the reuse of a finite number of prototypical expressive styles and stabilizing generation.

3. Context Encoder Integration

The context encoder pθ(zh)p_\theta(z|h)7 combines:

  • Phoneme embeddings from a three-sentence window (previous, current, next)
  • The corresponding BERT embeddings
  • The quantized style code pθ(zh)p_\theta(z|h)8

The network architecture consists of two 1D convolutional layers over concatenated phoneme-BERT inputs, a stack of four Transformer layers (dimension 768, 8 heads), injection of the style code via addition at each frame, and a final projection to 512 dimensions (VITS-compatible). During training, the context encoder is fine-tuned on top of a pre-trained VITS model—initializing from the official Bert-VITS 2.0 checkpoint and replacing its BERT block.

A two-phase procedure is used:

  1. Pre-train VITS on standard text input.
  2. Fine-tune with the full context encoder, sampling style codes from either speech or text encoders to promote generalization.

4. Training Regimen and Objective Functions

4.1 Loss Formulation

During fine-tuning, the model minimizes:

pθ(zh)p_\theta(z|h)9

zz0 is the sum of VITS losses:

  • KL divergence: zz1
  • Flow matching: zz2
  • Mel-spectrogram L1 loss: zz3
  • Adversarial loss: zz4

Loss weights are fixed as zz5, zz6, zz7.

During this phase, zz8 is only back-propagated through the context encoder's style pathway; the text encoder weights are frozen.

4.2 Data and Optimization

The model is trained and fine-tuned on:

  • 100H-Multi-Style (66 hours manually labeled) for the speech-style encoder
  • 6kH unlabeled audiobook speech (auto-transcribed) for text-style contrastive learning
  • 20H-Audiobook-HQ (utterance-ordered) for full-system fine-tuning

Optimizers are AdamW (T5, zz9; speech encoder, qϕ(zh,y)q_\phi(z|h,y)0) and Adam (qϕ(zh,y)q_\phi(z|h,y)1) for the TACA-VITS full system. VQ codebook has qϕ(zh,y)q_\phi(z|h,y)2 entries.

5. Empirical Evaluation

5.1 Style Alignment

TACA-VITS’s text encoder achieves strong alignment with the speech-style encoder: cosine similarity qϕ(zh,y)q_\phi(z|h,y)3 for the T5-based encoder versus qϕ(zh,y)q_\phi(z|h,y)4 for a BERT-based baseline. t-SNE visualization of 450 sentences across 9 chapters shows that the learned text-style vectors cluster tightly by chapter but remain well-dispersed within clusters, reflecting both stylistic consistency and intra-chapter variability.

5.2 Objective and Subjective Metrics

Objective TTS performance is reported via Mel Cepstral Distortion (MCD, dB) and Character Error Rate (CER, %):

TTS Model MCD ↓ CER(%) ↓
VITS (base) 4.23 5.80
TACA-VITS 4.21 5.99
LM (base) 8.92 13.9
TACA-LM 8.15 13.1

TACA-VITS maintains parity with baseline VITS in both metrics. For LM-based models, the TACA variant yields measurable improvements.

Subjective naturalness and expressiveness are assessed using MOS (1–5 scale, 20 native listeners):

Model NMOS↑ EMOS
VITS (base) 3.84±0.11 3.61±0.14
TACA-VITS 3.90±0.10 3.93±0.11
LM (base) 2.91±0.09 3.80±0.11
TACA-LM 3.22±0.10 4.05±0.10

TACA-VITS yields a +0.06 gain in NMOS and +0.32 in EMOS over VITS. The improvements for TACA-LM over LM (0.31 NMOS, 0.25 EMOS) further confirm the effectiveness of the TACA methodology for both VITS- and LM-based TTS.

6. Analytical Perspectives

  • Cross-modal contrastive learning taps large-scale unannotated audiobook data to induce a nuanced, high-dimensional style manifold from text. This facilitates fine-grained control of style and prosody without style labels or reference audio.
  • Vector quantization in the style space promotes re-use of prototypical style embeddings, which contributes to synthesis stability and coverage of diverse expressive modes.
  • The context encoder, with its access to past and future sentences, enables coherent long-form prosody—addressing a well-known shortcoming in prior sentence-level TTS systems.
  • Text-based style prediction allows fully text-driven inference, removing the need for external style references and supporting scalable deployment.
  • Both the text-style modeling and context encoder modules are modular and can be integrated into other single-stage TTS systems.

7. Significance and Future Directions

TACA-VITS advances the state of expressive TTS in the audiobook domain. By combining a robust text-predicted style embedding, context modeling over sentence windows, and a rigorously balanced VITS-style training objective, TACA-VITS achieves high-fidelity, expressive, and contextually coherent speech synthesis without explicit reference data or style annotation requirements. A plausible implication is that these mechanisms could generalize well across languages and genres, especially for long-form, multi-style synthesizer deployments (Guo et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TACA-VITS.