TACA-VITS: Expressive Audiobook TTS

Updated 5 April 2026

TACA-VITS is a neural text-to-speech architecture that enhances audiobook synthesis through text-aware style embedding and context encoding.
It leverages cross-modal contrastive learning to align text and speech in a shared style space, enabling diverse and coherent expressiveness.
The model integrates a context encoder processing multi-sentence inputs to ensure long-form prosodic coherence without manual style annotations.

TACA-VITS is a neural text-to-speech (TTS) architecture for expressive audiobook synthesis that integrates a text-aware style embedding and a context encoder into the VITS 2.0 backbone. It is designed to capture a broad range of expressive speaking styles, as found in professional audiobook narration, directly from text inputs—without requiring manually labeled style annotations or reference speech at inference. TACA-VITS leverages cross-modal contrastive learning to align text and speech in a shared style space, and employs cross-sentence textual context to yield long-form, coherent, and highly expressive synthetic speech (Guo et al., 2024).

1. Architectural Foundations

TACA-VITS extends the single-stage VITS 2.0 architecture, which unifies variational inference, adversarial training, and normalizing flows for end-to-end TTS. In the vanilla VITS pipeline, a text encoder generates hidden representations $h$ from input phonemes $x$ . The prior $p_\theta(z|h)$ parameterizes a normalizing-flow prior for the latent acoustic variable $z$ , the posterior $q_\phi(z|h,y)$ infers $z$ from ground-truth speech $y$ , and a generator $G_\theta(z)$ —paired with a multi-period discriminator $D_\psi$ —maps $z$ to the waveform.

TACA-VITS introduces two core augmentations:

A text-aware style embedding $x$ 0, inferred purely from the input text and trained to inhabit a speech-derived “style space” via cross-modal alignment.
A context encoder that processes not only the sentence in focus but also its neighboring sentence embeddings and the quantized style code, producing context-aware representations $x$ 1 that augment or replace the canonical $x$ 2 in all generative steps.

This architectural adjustment allows TACA-VITS models to handle text-predicted expressive styles and model prosody over multiple sentences, which is central to the audiobook domain where stylistic consistency and long-range prosodic coherence are required.

2. Text-Aware Style Space Learning

2.1 Speech-Style Encoder

A speech-style embedding $x$ 3 is extracted using an off-the-shelf SRL speech encoder initialized from HuBERT (layer 6). For each speech-text pair $x$ 4, $x$ 5 is computed. This embedding is spread out via semi-supervised contrastive loss, encouraging the formation of a diverse expressive latent manifold absent explicit labels.

A text encoder $x$ 6, initialized from Chinese T5, maps $x$ 7 to $x$ 8, which is trained to align with its speech counterpart and to preserve the structural geometry of speech-based style relations. The contrastive objective samples positive and negative text-text pairs based on the cosine similarity of their associated speech-style embeddings:

Positive if $x$ 9
Negative if $p_\theta(z|h)$ 0
Pairs with $p_\theta(z|h)$ 1 are ignored.

The text-style loss combines an InfoNCE objective $p_\theta(z|h)$ 2 and direct cosine alignment $p_\theta(z|h)$ 3:

$p_\theta(z|h)$ 4

Only the T5 encoder is optimized in this stage; the speech encoder remains frozen.

A vector quantization (VQ) module discretizes $p_\theta(z|h)$ 5 to a codebook index $p_\theta(z|h)$ 6, promoting the reuse of a finite number of prototypical expressive styles and stabilizing generation.

3. Context Encoder Integration

The context encoder $p_\theta(z|h)$ 7 combines:

Phoneme embeddings from a three-sentence window (previous, current, next)
The corresponding BERT embeddings
The quantized style code $p_\theta(z|h)$ 8

The network architecture consists of two 1D convolutional layers over concatenated phoneme-BERT inputs, a stack of four Transformer layers (dimension 768, 8 heads), injection of the style code via addition at each frame, and a final projection to 512 dimensions (VITS-compatible). During training, the context encoder is fine-tuned on top of a pre-trained VITS model—initializing from the official Bert-VITS 2.0 checkpoint and replacing its BERT block.

A two-phase procedure is used:

Pre-train VITS on standard text input.
Fine-tune with the full context encoder, sampling style codes from either speech or text encoders to promote generalization.

4. Training Regimen and Objective Functions

4.1 Loss Formulation

During fine-tuning, the model minimizes:

$p_\theta(z|h)$ 9

$z$ 0 is the sum of VITS losses:

KL divergence: $z$ 1
Flow matching: $z$ 2
Mel-spectrogram L1 loss: $z$ 3
Adversarial loss: $z$ 4

Loss weights are fixed as $z$ 5, $z$ 6, $z$ 7.

During this phase, $z$ 8 is only back-propagated through the context encoder's style pathway; the text encoder weights are frozen.

4.2 Data and Optimization

The model is trained and fine-tuned on:

100H-Multi-Style (66 hours manually labeled) for the speech-style encoder
6kH unlabeled audiobook speech (auto-transcribed) for text-style contrastive learning
20H-Audiobook-HQ (utterance-ordered) for full-system fine-tuning

Optimizers are AdamW (T5, $z$ 9; speech encoder, $q_\phi(z|h,y)$ 0) and Adam ( $q_\phi(z|h,y)$ 1) for the TACA-VITS full system. VQ codebook has $q_\phi(z|h,y)$ 2 entries.

5. Empirical Evaluation

5.1 Style Alignment

TACA-VITS’s text encoder achieves strong alignment with the speech-style encoder: cosine similarity $q_\phi(z|h,y)$ 3 for the T5-based encoder versus $q_\phi(z|h,y)$ 4 for a BERT-based baseline. t-SNE visualization of 450 sentences across 9 chapters shows that the learned text-style vectors cluster tightly by chapter but remain well-dispersed within clusters, reflecting both stylistic consistency and intra-chapter variability.

5.2 Objective and Subjective Metrics

Objective TTS performance is reported via Mel Cepstral Distortion (MCD, dB) and Character Error Rate (CER, %):

TTS Model	MCD ↓	CER(%) ↓
VITS (base)	4.23	5.80
TACA-VITS	4.21	5.99
LM (base)	8.92	13.9
TACA-LM	8.15	13.1

TACA-VITS maintains parity with baseline VITS in both metrics. For LM-based models, the TACA variant yields measurable improvements.

Subjective naturalness and expressiveness are assessed using MOS (1–5 scale, 20 native listeners):

Model	NMOS↑	EMOS↑
VITS (base)	3.84±0.11	3.61±0.14
TACA-VITS	3.90±0.10	3.93±0.11
LM (base)	2.91±0.09	3.80±0.11
TACA-LM	3.22±0.10	4.05±0.10

TACA-VITS yields a +0.06 gain in NMOS and +0.32 in EMOS over VITS. The improvements for TACA-LM over LM (0.31 NMOS, 0.25 EMOS) further confirm the effectiveness of the TACA methodology for both VITS- and LM-based TTS.

6. Analytical Perspectives

Cross-modal contrastive learning taps large-scale unannotated audiobook data to induce a nuanced, high-dimensional style manifold from text. This facilitates fine-grained control of style and prosody without style labels or reference audio.
Vector quantization in the style space promotes re-use of prototypical style embeddings, which contributes to synthesis stability and coverage of diverse expressive modes.
The context encoder, with its access to past and future sentences, enables coherent long-form prosody—addressing a well-known shortcoming in prior sentence-level TTS systems.
Text-based style prediction allows fully text-driven inference, removing the need for external style references and supporting scalable deployment.
Both the text-style modeling and context encoder modules are modular and can be integrated into other single-stage TTS systems.

7. Significance and Future Directions

TACA-VITS advances the state of expressive TTS in the audiobook domain. By combining a robust text-predicted style embedding, context modeling over sentence windows, and a rigorously balanced VITS-style training objective, TACA-VITS achieves high-fidelity, expressive, and contextually coherent speech synthesis without explicit reference data or style annotation requirements. A plausible implication is that these mechanisms could generalize well across languages and genres, especially for long-form, multi-style synthesizer deployments (Guo et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Text-aware and Context-aware Expressive Audiobook Speech Synthesis (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TACA-VITS.