RNN Approaches to Text Normalization: A Challenge (1611.00068v2)

Published 31 Oct 2016 in cs.CL

Abstract: This paper presents a challenge to the community: given a large corpus of written text aligned to its normalized spoken form, train an RNN to learn the correct normalization function. We present a data set of general text where the normalizations were generated using an existing text normalization component of a text-to-speech system. This data set will be released open-source in the near future. We also present our own experiments with this data set with a variety of different RNN architectures. While some of the architectures do in fact produce very good results when measured in terms of overall accuracy, the errors that are produced are problematic, since they would convey completely the wrong message if such a system were deployed in a speech application. On the other hand, we show that a simple FST-based filter can mitigate those errors, and achieve a level of accuracy not achievable by the RNN alone. Though our conclusions are largely negative on this point, we are actually not arguing that the text normalization problem is intractable using an pure RNN approach, merely that it is not going to be something that can be solved merely by having huge amounts of annotated text data and feeding that to a general RNN model. And when we open-source our data, we will be providing a novel data set for sequence-to-sequence modeling in the hopes that the the community can find better solutions. The data used in this work have been released and are available at: https://github.com/rwsproat/text-normalization-data

Authors (2)

Richard Sproat (12 papers)
Navdeep Jaitly (67 papers)

Citations (105)

View on Semantic Scholar

Summary

RNN Approaches to Text Normalization: A Challenge

The paper "RNN Approaches to Text Normalization: A Challenge" authored by Richard Sproat and Navdeep Jaitly from Google, Inc., addresses the intricate problem of text normalization through the lens of recurrent neural networks (RNNs). This work is pivotal for advancements in text-to-speech (TTS) and automatic speech recognition (ASR) systems, where converting written forms into spoken equivalents is essential.

Overview

Text normalization in the context of this research involves transforming non-standard written forms, such as numerical expressions, into a format suitable for verbalization. For instance, the written phrase "6ft" would be normalized to "six feet" when spoken. The authors present a challenge to the academic and research community by proposing a publicly available dataset and encouraging the development of effective models using RNNs.

Experimental Methodology

The authors conducted experiments using various RNN architectures on a large dataset derived from English and Russian texts, which had been processed through Google's Kestrel text normalization system. The experiments evaluated the efficacy of different neural architectures in accurately predicting the normalized output. The architectures included both shallow and deep configurations of LSTM models, as well as sequence-to-sequence models with attention mechanisms.

Key Findings

The results showed that while some RNN architectures achieved high overall accuracy, significant challenges persist with certain semiotic classes such as measure expressions and currency amounts. The deep learning models occasionally produced errors by substituting related but incorrect terms, such as reading £900 as "nine hundred euros."

The experiments demonstrated the limitations of relying solely on RNNs for text normalization, as these models were prone to errors that could lead to incorrect verbalizations in practical applications. Notably, the implemented models performed well when the correct choice was among proposed candidates, but this was insufficient to ensure dependable outputs in all cases.

Finite-State Transducer Integration

To address these shortcomings, the authors explored the integration of finite-state transducers (FSTs) as a filtering mechanism. This combination allowed for more precise predictions by constraining the output of the RNNs with rule-based filters. The FST-based filter effectively mitigated many errors by ensuring that the output matched a predefined set of acceptable transformations.

Implications and Future Directions

This research provokes the community to innovate beyond the "turn-the-crank" approaches commonly associated with deep learning. The findings underscore the necessity for hybrid models that incorporate both learned distributions and rule-based constraints, particularly for tasks where precision is critical, such as in TTS systems.

In future developments, researchers might explore alternative architectures or hybrid strategies that can bridge the gap between the high-level capabilities of RNNs and the need for stringent accuracy in text normalization applications. Moreover, expanding the dataset across multiple languages could further elucidate the challenges and guide the development of universally applicable solutions.

The paper stands as a call to action for the research community to engage with the dataset and challenge proposed, advancing the field of text normalization with innovative AI-driven solutions. This area remains a fertile ground for exploration, with significant potential to improve the linguistic fidelity of digital voice systems.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - rwsproat/text-normalization-data: Links to data used in Sproat & Jaitly (https://arxiv.org/abs/1611.00068) experiments. (76 stars)