- The paper presents a comprehensive rule-based normalization pipeline that converts Vietnamese non-standard words, including numbers, dates, and loanwords, into spoken forms.
- It employs a seven-stage processโfrom Unicode normalization to acronym expansionโensuring high throughput with sub-millisecond processing per utterance.
- By eliminating neural dependencies, VietNormalizer offers robust and scalable text normalization for TTS and NLP in resource-constrained environments.
VietNormalizer: A Zero-Dependency Vietnamese Text Normalization Library for TTS and NLP
Introduction
Text normalization (TN) is fundamental in TTS and NLP pipelines, enabling conversion of written-form inputโoften replete with NSWs (numbers, dates, currencies, acronyms, foreign terms)โinto canonical spoken forms. The Vietnamese language poses acute normalization challenges: tonal morphology encoded in diacritics, irregular number verbalization patterns, dense NSW populations in real-world text, and extensive loanword integration. Existing Vietnamese TN tools either embed normalization within larger frameworks with neural dependencies or cover only a limited set of NSW classes. VietNormalizer introduces a comprehensive, zero-dependency Python library that delivers robust rule-based normalization, addressing limitations in scope, deployability, and extensibility previously seen in the Vietnamese NLP ecosystem (2603.04145).
Technical Contributions
VietNormalizer implements a seven-stage normalization pipeline, entirely in pure Python 3.8+:
- Unicode normalization: NFC standardization, removal of emojis/special characters.
- Date/time verbalization: Conversion of DD/MM/YYYY and HH:MM patterns to fully spoken Vietnamese strings, ensuring syntactic correctness and proper tone assignment.
- Currency/percentage expansion: Handling both VND and USD, as well as percentage expressions.
- Recursive number verbalization: Robust decomposition with language-specific handling for forms like "mฦฐแปi", "hai mฦฐฦกi", etc., covering the entire integer and decimal range.
- Acronym and loanword resolution: CSV-configurable dictionaries for expansion/transliteration, with single-regex batch matching for O(n) throughput relative to input length.
- High-throughput batch processing: All regex patterns pre-compiled at initialization; no neural or external API dependencies; optimized for large-scale TTS/ASR corpus preparation.
- Public API: Direct pip installation, modular dictionary management, flexible preprocessing modes.
This design yields practical benefits: minimal memory footprint, sub-millisecond processing latencies per utterance, and immediate suitability for resource-constrained deployments, including embedded hardware and production-grade batch pipelines.
Comparative Analysis with Prior Approaches
VietNormalizer distinctly diverges from neural approaches (e.g., BERT-BiGRU-CRF pipelines [trang2022nsw], ViSoLex [nguyen2025visolex]) by eliminating dependency and inference overhead. Neural TN methods exhibit strong in-distribution accuracy but require several gigabytes of model weight storage and incur nontrivial inference latencies, fundamentally limiting their deployability in real-time applications and low-resource environments. Furthermore, neural approaches are sensitive to error propagation from NSW detection, with limited domain generalization.
Toolkit-embedded normalization (e.g., underthesea [underthesea2022]) is restricted to Unicode/diacritic correction, omitting functional NSW expansion and loanword transliteration required for TTS. These toolkits force users to install irrelevant NLP components, adding unnecessary overhead.
VietNormalizer addresses all NSW categories encountered in Vietnamese TTS/NLPโnumbers, dates/times, currencies, percentages, acronyms, loanwordsโthrough explicit, auditable rule-based mappings. Context-dependent ambiguity is handled via ordered rule application and priority-based pattern matching, defaulting to statistically predominant interpretations in Vietnamese corpora, a practical solution drawn from production TN workflows [ebden2015kestrel].
Empirical Results and Claims
While the paper focuses on architectural and functional coverage rather than direct accuracy evaluation, it asserts:
- Comprehensive coverage: All major Vietnamese NSW types are normalized with deterministic, interpretable rules.
- Performance: Batch normalization throughput is described at tens of thousands of utterances per minute per CPU core, suitable for usage in pipelines like VietSuperSpeech [do2026vietsuperspeech].
- Deployability: Zero external dependencies and sub-millisecond normalization enable robust usage even on serverless or embedded platforms.
These claims are substantiated by API examples and by the comparison table contrasting VietNormalizer against all contemporary Vietnamese TN solutions.
Practical and Theoretical Implications
VietNormalizer represents a blueprint for rule-based TN in low-resource languages, especially those with tonal and agglutinative structures. For languages lacking labeled TN corpora or neural TN tools, the architecture enables rapid authoring, extensibility, and scalable deployment, leveraging linguistic expertise without the burden of neural infrastructure.
Practically, VietNormalizer allows immediate integration in large TTS/ASR corpus pipelines, supporting corpus construction, fine-tuning, and real-time applications with negligible overhead. The CSV-driven extensibility supports adaptation to domain-specific acronyms or loanwords.
Theoretically, the pipeline demonstrates the continued relevance and necessity of rule-based TN in contexts where neural TN is infeasible due to data or resource constraints. It also highlights the structural adaptability of its methods to related languages (Thai, Khmer, Lao, Burmese), potentially facilitating multilingual TTS development without requiring extensive labeled datasets or advanced linguistic engineering.
Limitations and Prospects
The paper acknowledges several limitations:
- Contextual disambiguation: Ambiguous NSWs requiring syntactic analysis cannot be exhaustively resolved; future work may include lightweight POS-tagging or context-window heuristics.
- Proper noun recognition and code-switching: Person/place/organization names and unseen foreign terms are not always covered; future versions may incorporate NER modules and token-level language identification.
- Dictionary coverage: Built-in dictionaries are not exhaustive; community contributions are encouraged.
The authors plan to address inverse text normalization (ITN) in future releases, broadening the utility into written-to-spoken and spoken-to-written conversions.
Conclusion
VietNormalizer delivers a robust, dependency-free, rule-based Vietnamese TN solution tailored for TTS and NLP applications. By combining comprehensive NSW coverage, high-throughput batch processing, and extensible dictionary-based handling, it fills a critical gap in the Vietnamese NLP landscape. Its architecture and paradigm establish a replicable framework for other low-resource, tonal, and agglutinative languages, offering practical and scalable text normalization without the prohibitive cost and complexity of neural models. The open-source nature and modularity support broad adoption and adaptation for domain-specific requirements and multilingual pipeline development.