Incremental FastPitch: Chunk-based High Quality Text to Speech (2401.01755v1)

Published 3 Jan 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.

References (20)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/123543935/status/1742765304277557315

https://twitter.com/783440136/status/1742935868560851129

https://twitter.com/176540776/status/1742935149749456993

https://twitter.com/fly51fly/status/1743027329230942506

https://twitter.com/919860212/status/1742930429488087379

Incremental FastPitch: Chunk-based High Quality Text to Speech (2401.01755v1)

Summary

Related Papers

Tweets