PAST: Phonetic-Acoustic Speech Tokenizer

Published 20 May 2025 in cs.SD, cs.CL, cs.LG, and eess.AS | (2505.14470v2)

Abstract: We present PAST, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. Unlike previous approaches that rely on pretrained self-supervised models, PAST employs supervised phonetic data, directly integrating domain knowledge into the tokenization process via auxiliary tasks. Additionally, we introduce a streamable, causal variant of PAST, enabling real-time speech applications. Results demonstrate that PAST surpasses existing evaluated baseline tokenizers across common evaluation metrics, including phonetic representation and speech reconstruction. Notably, PAST also achieves superior performance when serving as a speech representation for speech LLMs, further highlighting its effectiveness as a foundation for spoken language generation. To foster further research, we release the full implementation. For code, model checkpoints, and samples see: https://pages.cs.huji.ac.il/adiyoss-lab/PAST

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a novel supervised framework that integrates phonetic classification and CTC tasks to improve both phonetic representation and signal reconstruction.
It employs a unique architecture with an encoder, quantizer using Residual Vector Quantization, and decoder, enhanced by a transformer and auxiliary heads.
Experimental results demonstrate that PAST outperforms state-of-the-art models with a PNMI of 0.75 and SISNR of 4.84 while reducing computational overhead.

PAST: Phonetic-Acoustic Speech Tokenizer

Introduction

The paper presents PAST, a Phonetic-Acoustic Speech Tokenizer, which offers an innovative framework for joint modeling of phonetic information with signal reconstruction. Unlike traditional approaches relying on pretrained self-supervised learning (SSL) models, PAST employs supervised phonetic data, thus integrating domain knowledge directly into the tokenization process via auxiliary tasks. This approach not only enhances phonetic representation but also facilitates real-time applications through its causal variant, demonstrating superior performance across several standard evaluation metrics.

Method and Architecture

PAST introduces a novel architecture comprising three main components: Encoder, Quantizer, and Decoder. The encoder leverages a convolutional module followed by a transformer encoder, while the quantizer employs Residual Vector Quantization (RVQ) to generate discrete tokens. The model's design is shown in Figure 1, where auxiliary heads process the output of the first vector quantization module.

Figure 1: Schematic of the PAST pipeline. The auxiliary heads use the output of the first vector quantization module as input.

PAST employs auxiliary tasks for phonetic classification and connectionist temporal classification (CTC) for character alignment, enhancing its ability to capture phonetic content directly. These tasks replace distillation from SSL models, further simplifying the architecture and reducing computational overhead.

Experimental Evaluation

The efficacy of PAST is demonstrated through comprehensive experiments comparing it with state-of-the-art tokenizers such as SpeechTokenizer and X-Codec. PAST outperforms these baselines across phonetic and acoustic metrics as seen in Table 1 and Table 2. Notably, PAST achieves a Phonetic Normalized Mutual Information (PNMI) score of 0.75 and superior signal reconstruction quality, with a Scale-Invariant Signal-to-Noise Ratio (SISNR) of 4.84.

Phonetic Information and Reconstruction Metrics

Table 1 highlights PAST's performance on phonetic metrics, while Table 2 discusses signal reconstruction quality. The results indicate that PAST achieves notable improvements, especially in phonetic information capture without compromising signal quality.

Speech Language Modeling

Further validation is provided by evaluating PAST's performance in Speech Language Modeling (SLM). By leveraging a common backbone using AudioGen’s architecture, PAST shows improved sWUGGY scores, indicating better phonetic information preservation and interpretation during speech modeling.

Component Ablation and Analysis

An ablation study underscores the significance of different components within PAST. The auxiliary losses, particularly, play a critical role in encapsulating phonetic information within the latent space. Additionally, the transformer module, in conjunction with a strategic skip connection sampling, enhances sequence modeling and training stability.

Conclusion

PAST showcases a significant advancement in speech tokenization by optimizing phonetic and acoustic representations through supervised learning. Without the dependency on SSL models, PAST simplifies the tokenization architecture and sets a new benchmark in tokenization efficacy. Future enhancements may focus on extending PAST for use in multilingual settings, broadening its applicability.

By presenting PAST, this paper contributes a robust framework that enhances the integration of phonetic information within tokenization processes, providing a pathway for more accurate and efficient speech LLM applications.

Markdown Report Issue