Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Techniques and Challenges in Speech Synthesis (1709.07552v1)

Published 22 Sep 2017 in cs.SD and eess.AS

Abstract: The aim of this project was to develop and implement an English language Text-to-Speech synthesis system. This involved a study of mechanisms of human speech production, a review of techniques in speech synthesis, and analysis of tests used to evaluate the effectiveness of synthesized speech. It was determined that a diphone synthesis system was the most effective choice for the scope of this project. A method of automatically identifying and extracting diphones from prompted speech was designed, allowing for the creation of a diphone database by a speaker in less than 40 minutes. CMUdict was used to determine the pronunciation of known words. A system for smoothing the transitions between diphone recordings was designed and implemented. CMUdict was then used to train a maximum-likelihood prediction system to determine the correct pronunciation of unknown English language alphabetic words. Then, a Part Of Speech tagger was designed to find the lexical class of words within a sentence. A method of altering the pitch, duration, and volume of the produced voice over time was designed, being a combination of the TD-PSOLA algorithm and a novel approach referred to as Unvoiced Speech Duration Shifting. This minimises distortion of the voice when shifting the pitch or duration, while maximising computational efficiency by operating in the time domain. This approach was used to add correct lexical stress to vowels within words. A text tokenisation system was developed to handle arbitrary text input, allowing pronunciation of numerical input tokens and use of appropriate pauses for punctuation. Methods for further improving sentence level speech naturalness were discussed. Finally, the system was tested with listeners for its intelligibility and naturalness.

Citations (3)

Summary

We haven't generated a summary for this paper yet.