Papers
Topics
Authors
Recent
Search
2000 character limit reached

VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications

Published 4 Mar 2026 in cs.CL and cs.NE | (2603.04145v1)

Abstract: We present VietNormalizer1, an open-source, zero-dependency Python library for Vietnamese text normalization targeting Text-to-Speech (TTS) and NLP applications. Vietnamese text normalization is a critical yet underserved preprocessing step: real-world Vietnamese text is densely populated with non-standard words (NSWs), including numbers, dates, times, currency amounts, percentages, acronyms, and foreign-language terms, all of which must be converted to fully pronounceable Vietnamese words before TTS synthesis or downstream language processing. Existing Vietnamese normalization tools either require heavy neural dependencies while covering only a narrow subset of NSW classes, or are embedded within larger NLP toolkits without standalone installability. VietNormalizer addresses these gaps through a unified, rule-based pipeline that: (1) converts arbitrary integers, decimals, and large numbers to Vietnamese words; (2) normalizes dates and times to their spoken Vietnamese forms; (3) handles VND and USD currency amounts; (4) expands percentages; (5) resolves acronyms via a customizable CSV dictionary; (6) transliterates non-Vietnamese loanwords and foreign terms to Vietnamese phonetic approximations; and (7) performs Unicode normalization and emoji/special-character removal. All regular expression patterns are pre-compiled at initialization, enabling high-throughput batch processing with minimal memory overhead and no GPU or external API dependency. The library is installable via pip install vietnormalizer, available on PyPI and GitHub at https://github.com/nghimestudio/vietnormalizer, and released under the MIT license. We discuss the design decisions, limitations of existing approaches, and the generalizability of the rule-based normalization paradigm to other low-resource tonal and agglutinative languages.

Summary

  • The paper presents a comprehensive rule-based normalization pipeline that converts Vietnamese non-standard words, including numbers, dates, and loanwords, into spoken forms.
  • It employs a seven-stage processโ€”from Unicode normalization to acronym expansionโ€”ensuring high throughput with sub-millisecond processing per utterance.
  • By eliminating neural dependencies, VietNormalizer offers robust and scalable text normalization for TTS and NLP in resource-constrained environments.

VietNormalizer: A Zero-Dependency Vietnamese Text Normalization Library for TTS and NLP

Introduction

Text normalization (TN) is fundamental in TTS and NLP pipelines, enabling conversion of written-form inputโ€”often replete with NSWs (numbers, dates, currencies, acronyms, foreign terms)โ€”into canonical spoken forms. The Vietnamese language poses acute normalization challenges: tonal morphology encoded in diacritics, irregular number verbalization patterns, dense NSW populations in real-world text, and extensive loanword integration. Existing Vietnamese TN tools either embed normalization within larger frameworks with neural dependencies or cover only a limited set of NSW classes. VietNormalizer introduces a comprehensive, zero-dependency Python library that delivers robust rule-based normalization, addressing limitations in scope, deployability, and extensibility previously seen in the Vietnamese NLP ecosystem (2603.04145).

Technical Contributions

VietNormalizer implements a seven-stage normalization pipeline, entirely in pure Python 3.8+:

  • Unicode normalization: NFC standardization, removal of emojis/special characters.
  • Date/time verbalization: Conversion of DD/MM/YYYY and HH:MM patterns to fully spoken Vietnamese strings, ensuring syntactic correctness and proper tone assignment.
  • Currency/percentage expansion: Handling both VND and USD, as well as percentage expressions.
  • Recursive number verbalization: Robust decomposition with language-specific handling for forms like "mฦฐแปi", "hai mฦฐฦกi", etc., covering the entire integer and decimal range.
  • Acronym and loanword resolution: CSV-configurable dictionaries for expansion/transliteration, with single-regex batch matching for O(n)O(n) throughput relative to input length.
  • High-throughput batch processing: All regex patterns pre-compiled at initialization; no neural or external API dependencies; optimized for large-scale TTS/ASR corpus preparation.
  • Public API: Direct pip installation, modular dictionary management, flexible preprocessing modes.

This design yields practical benefits: minimal memory footprint, sub-millisecond processing latencies per utterance, and immediate suitability for resource-constrained deployments, including embedded hardware and production-grade batch pipelines.

Comparative Analysis with Prior Approaches

VietNormalizer distinctly diverges from neural approaches (e.g., BERT-BiGRU-CRF pipelines [trang2022nsw], ViSoLex [nguyen2025visolex]) by eliminating dependency and inference overhead. Neural TN methods exhibit strong in-distribution accuracy but require several gigabytes of model weight storage and incur nontrivial inference latencies, fundamentally limiting their deployability in real-time applications and low-resource environments. Furthermore, neural approaches are sensitive to error propagation from NSW detection, with limited domain generalization.

Toolkit-embedded normalization (e.g., underthesea [underthesea2022]) is restricted to Unicode/diacritic correction, omitting functional NSW expansion and loanword transliteration required for TTS. These toolkits force users to install irrelevant NLP components, adding unnecessary overhead.

VietNormalizer addresses all NSW categories encountered in Vietnamese TTS/NLPโ€”numbers, dates/times, currencies, percentages, acronyms, loanwordsโ€”through explicit, auditable rule-based mappings. Context-dependent ambiguity is handled via ordered rule application and priority-based pattern matching, defaulting to statistically predominant interpretations in Vietnamese corpora, a practical solution drawn from production TN workflows [ebden2015kestrel].

Empirical Results and Claims

While the paper focuses on architectural and functional coverage rather than direct accuracy evaluation, it asserts:

  • Comprehensive coverage: All major Vietnamese NSW types are normalized with deterministic, interpretable rules.
  • Performance: Batch normalization throughput is described at tens of thousands of utterances per minute per CPU core, suitable for usage in pipelines like VietSuperSpeech [do2026vietsuperspeech].
  • Deployability: Zero external dependencies and sub-millisecond normalization enable robust usage even on serverless or embedded platforms.

These claims are substantiated by API examples and by the comparison table contrasting VietNormalizer against all contemporary Vietnamese TN solutions.

Practical and Theoretical Implications

VietNormalizer represents a blueprint for rule-based TN in low-resource languages, especially those with tonal and agglutinative structures. For languages lacking labeled TN corpora or neural TN tools, the architecture enables rapid authoring, extensibility, and scalable deployment, leveraging linguistic expertise without the burden of neural infrastructure.

Practically, VietNormalizer allows immediate integration in large TTS/ASR corpus pipelines, supporting corpus construction, fine-tuning, and real-time applications with negligible overhead. The CSV-driven extensibility supports adaptation to domain-specific acronyms or loanwords.

Theoretically, the pipeline demonstrates the continued relevance and necessity of rule-based TN in contexts where neural TN is infeasible due to data or resource constraints. It also highlights the structural adaptability of its methods to related languages (Thai, Khmer, Lao, Burmese), potentially facilitating multilingual TTS development without requiring extensive labeled datasets or advanced linguistic engineering.

Limitations and Prospects

The paper acknowledges several limitations:

  • Contextual disambiguation: Ambiguous NSWs requiring syntactic analysis cannot be exhaustively resolved; future work may include lightweight POS-tagging or context-window heuristics.
  • Proper noun recognition and code-switching: Person/place/organization names and unseen foreign terms are not always covered; future versions may incorporate NER modules and token-level language identification.
  • Dictionary coverage: Built-in dictionaries are not exhaustive; community contributions are encouraged.

The authors plan to address inverse text normalization (ITN) in future releases, broadening the utility into written-to-spoken and spoken-to-written conversions.

Conclusion

VietNormalizer delivers a robust, dependency-free, rule-based Vietnamese TN solution tailored for TTS and NLP applications. By combining comprehensive NSW coverage, high-throughput batch processing, and extensible dictionary-based handling, it fills a critical gap in the Vietnamese NLP landscape. Its architecture and paradigm establish a replicable framework for other low-resource, tonal, and agglutinative languages, offering practical and scalable text normalization without the prohibitive cost and complexity of neural models. The open-source nature and modularity support broad adoption and adaptation for domain-specific requirements and multilingual pipeline development.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.