Phonemic Common Label Set (CLS)

Updated 26 November 2025

Phonemic CLS is a unified phonemic inventory that merges diverse language-specific units into a standardized set of IPA-style labels for multilingual speech processing.
Mapping procedures leverage rule-based and learned G2P methods to normalize native scripts into CLS, facilitating robust cross-lingual applications.
CLS integration in neural architectures reduces output space and improves metrics like WER and CER, demonstrating significant gains in low-resource settings.

A Phonemic Common Label Set (CLS) is a unified inventory of phonemic symbols that enables cross-lingual and multilingual speech processing by collapsing language- or script-specific units into a set of acoustically and articulatorily motivated labels. Distinct from grapheme-based approaches, the CLS is defined phonemically, supporting direct mapping of native-spoken scripts or orthographies into a set of canonical phones or IPA-style tokens. The CLS paradigm underpins modern approaches to multilingual automatic speech recognition (ASR), speech synthesis, and computational phonology, with widespread application in the Indian subcontinent, cross-linguistic child speech resources, and typologically diverse multilingual ASR systems (A et al., 2022, Jayakumar et al., 2023, Yusuyin et al., 4 Jun 2024, Goriely et al., 3 Apr 2025, Gangwar et al., 19 Nov 2025).

1. Formal Definition and Construction

The Phonemic CLS is formally constructed as follows:

For a collection of languages $\mathcal{L} = \{\ell_1, \ell_2, ..., \ell_L\}$ , let $P_{\ell}$ be the context-independent phone inventory for language $\ell$ (derived from pronunciation lexica, G2P modules, or databases like Phoible or LanguageNet).
The CLS $C$ is the union of all per-language phoneme sets: $C = \bigcup_{\ell \in \mathcal{L}} P_\ell$ .
Each native grapheme set $G_i$ (or orthographic form for language $i$ ) is mapped into $C$ via a rule-based or learned G2P (grapheme-to-phoneme) mapping $f_i: G_i \to C$ .
In practice, the CLS may be standardized as a set of IPA symbols (after normalization, diacritic-stripping, and variant-collapsing rules) or as a compact set of ASCII string labels for implementation efficiency in large-vocabulary models (A et al., 2022, Yusuyin et al., 4 Jun 2024, Goriely et al., 3 Apr 2025).

For instance, in an Indian context, the CLS may consist of 56–72 labels representing fundamental phones with coverage across all major scripts, vowels (including long, nasalized, diphthongs), consonants, and select suprasegmental properties (A et al., 2022, Gangwar et al., 19 Nov 2025, P, 14 Oct 2024). In multilingual or cross-linguistic models, 60–73 IPA-derived segments are typical (Yusuyin et al., 4 Jun 2024, Goriely et al., 3 Apr 2025).

2. Mapping Procedures and Normalization

Mapping native-language data into the CLS proceeds via deterministic or learned functions:

Rule-Based G2P: For phonetically transparent orthographies (e.g., many Indian scripts), deterministic mappings assign each script grapheme to a CLS label. The process covers multiple scripts and languages, exploiting shared phonemic space (A et al., 2022, Gangwar et al., 19 Nov 2025).
IPA-Based G2P: In typologically diverse contexts, G2P toolkits (Phonetisaurus, LanguageNet FSTs, Epitran, etc.) assign IPA labels to tokens. Post-processing applies normalization ( $f_1$ : diacritic stripping, $f_2$ : symbol collapsing) to yield a unified symbol inventory (Yusuyin et al., 4 Jun 2024, Goriely et al., 3 Apr 2025).
Folding Maps: Ambiguous or language-specific phones are handled via “folding maps,” merging near-equivalent raw G2P outputs to canonical CLS entries, or splitting multi-character tokens as required for the inventory (Goriely et al., 3 Apr 2025).
Feature-Based Mapping: For dynamic inventories, such as in CLTS, IPA symbols are parsed into sets of phonological features, and then mapped onto binary/ternary feature vectors for high-granularity comparison and inventory expansion (Rubehn et al., 7 May 2024).

A typical mapping pseudocode (adapted from (A et al., 2022, Goriely et al., 3 Apr 2025)) is:

for token in source_text:
    symbol = g2p(token)          # script/orthographically informed
    symbol_norm = normalize(symbol)  # e.g., strip diacritics, collapse variants
    cls_label = fold(symbol_norm)    # merge or split as per target CLS
    output.append(cls_label)

3. Architectural Integration in Speech Systems

CLS is exploited as an output target and intermediate representation in various neural architectures:

Multilingual CTC/Transformer ASR: A CLS-aware decoder predicts sequences of CLS symbols, sharply reducing the output space compared to multilingual graphemes and supporting robust transfer learning (Müller et al., 2017, A et al., 2022, Gangwar et al., 19 Nov 2025, Jayakumar et al., 2023).
Multi-Decoder Architectures: A first decoder produces the CLS sequence from speech; a second decoder maps CLS to native-script output, enabling back-projection to user-facing text and leveraging shared representations for improved grapheme prediction (e.g., schwa deletion, context-specific geminates) (Gangwar et al., 19 Nov 2025, A et al., 2022).
CLS-to-Script Conversion and Language Tagging: Post-ASR, dedicated modules convert from CLS back to language-specific scripts, applying script-aware, language-conditioned transformations. Explicit language-ID embeddings can also condition the sequence decoder for further disambiguation (Jayakumar et al., 2023).
Distinctive-Feature Probing: Embeddings induced over CLS tokens can be probed for well-structured phonological features, confirming that neural LMs encode feature structure distributionally (Goriely et al., 3 Apr 2025).

4. Empirical Impact and Performance

Training in CLS space consistently demonstrates substantial gains in multilingual ASR scenarios:

Output-Space Reduction and Data Efficiency: Collapsing multiple script- or language-specific inventories into a single CLS reduces the output space, improves parameter sharing, and enables models to generalize better across low-resource settings (A et al., 2022, Yusuyin et al., 4 Jun 2024, Goriely et al., 3 Apr 2025).
Recognition Metrics: CLS-based models improve character error rate (CER) and word error rate (WER) over monolingual or naive multilingual baselines. For instance, 9.3% absolute WER gains and up to 20% improvements in generalization to out-of-distribution languages were reported (Gangwar et al., 19 Nov 2025, Jayakumar et al., 2023).
Robustness: CLS models outperform in few-shot/fine-tuning scenarios and exhibit significantly less catastrophic forgetting than subword-based or grapheme-based systems (Yusuyin et al., 4 Jun 2024).
Disentanglement: The use of shared CLS reduces cross-language confusion by grouping near-homorganic and allophonic variants under a canonical label, leading to more stable and interpretable sequence outputs (Gangwar et al., 19 Nov 2025, Jayakumar et al., 2023).

System	Avg. WER (seen)	WER (unseen)	Additional Notes
Monolingual-Phn	8.14%	99–100% (1h)	Poor crosslingual, overfit to source
Multi-Phn-CLS	7.64%	4.3/2.43%	18% gain over subword (seen/lang)
Multi-Subword	9.30%	>3x higher	Outperformed by phoneme CLS

CLS-based systems retain or improve accuracy for language/dialect identification and facilitate code-switching scenarios in spoken language modeling. Notable empirical advances have been observed in Indian ASR and the CV-Lang10 challenge contexts (Gangwar et al., 19 Nov 2025, Yusuyin et al., 4 Jun 2024, Jayakumar et al., 2023).

5. Phonological Feature Label Sets and Analytical Applications

Beyond atomic label inventories, several resources and toolkits yield CLS with explicit phonological feature vectors for analytical work:

Phoible-Centric Inventories: CLS as the union of best-fit phoneme sets per-language—grounded in IPA transcriptions and distinctive-feature matrices—facilitates discovery of typological universals and efficient model probing (Goriely et al., 3 Apr 2025).
CLTS/Soundvectors (Feature Bundles): Phoneme labels are mapped to binary/ternary vectors along up to 39 dimensions (major class, manner, place, laryngeal, tone, trajectory), created dynamically using pyclts, providing a foundation for quantitative cross-linguistic phonetic comparison, alignment, and dialectometry (Rubehn et al., 7 May 2024). This approach ensures that any attested IPA symbol can be assigned to the CLS space, enabling both granular analysis and practical coverage.

$\begin{array}{l|rrrrr} \text{Segment} & \text{Sonorant} & \text{Continuant} & \text{Labial} & \text{Voice} & \text{Nasal} \ \hline p & -1 & -1 & +1 & -1 & 0 \ b & -1 & -1 & +1 & +1 & 0 \ n & +1 & -1 & +1 & +1 & +1 \ a & +1 & 0 & 0 & 0 & 0 \ \end{array}$

(Rubehn et al., 7 May 2024)

These representations allow quantitative measurement (cosine or Hamming) of sound similarity and are directly usable in linguistics, ASR pre-training, and typology.

6. Coverage, Extensibility, and Limitations

CLS coverage is tailored by target language sets and design goals:

Breadth: Current implementations cover dozens of scripts (e.g., 13 Indian scripts + English), hundreds of languages (e.g., IPA-CHILDES: 31), or large-scale typological samples (CLTS: >8,500 segments) (P, 14 Oct 2024, A et al., 2022, Goriely et al., 3 Apr 2025, Rubehn et al., 7 May 2024).
Extensibility: IPA-based CLS frameworks (CLTS, Phoible-union) support dynamic augmentation, allowing representation of speech data from new languages, newly observed phones, or rare diacritic combinations (Rubehn et al., 7 May 2024, Goriely et al., 3 Apr 2025).
Limitations: CLS approaches face challenges regarding schwa deletion/restoration, geminate correction, error-propagation in cascaded models, and limited script-specific nuance. Improvements may include learnable G2P mappings, integration of prosodic/tonal features, or explicit language-ID signals in neural architectures (A et al., 2022, Gangwar et al., 19 Nov 2025).

A plausible implication is that future work will increasingly emphasize learnable, feature-augmented mapping and greater adaptability to phonological variance beyond current Indo-Aryan or European script boundaries.

7. Applications and Significance

Phonemic CLS has enabled significant advances in both engineering and scientific analysis:

ASR and TTS Models: Cross-lingual and multilingual neural architectures leverage CLS for robust, scalable, and interpretable output spaces, with demonstrated success in low-resource and code-switching scenarios (A et al., 2022, Jayakumar et al., 2023, Gangwar et al., 19 Nov 2025, P, 14 Oct 2024).
Phonological Modeling: Large cross-linguistic speech corpora (e.g., IPA-CHILDES) and neural LLMs trained over CLS tokenize enable probing of phonological class and feature learning (Goriely et al., 3 Apr 2025).
Linguistic Analysis and Distance Measurement: CLS-based, feature-vector representations underpin new approaches to phonetic alignment, dialectometry, protoform reconstruction, and typological generalization (Rubehn et al., 7 May 2024, Goriely et al., 3 Apr 2025).
Code-Switching: CLS enables fluid, script-oblivious modeling of code-switched utterances, especially prevalent in polyglossic contexts such as the Indian subcontinent (P, 14 Oct 2024).

The overarching significance of CLS lies in its capacity to unify data, boost transfer learning, and promote systematic, scalable machine learning across the world’s linguistic diversity, while remaining extensible to both phonemic and feature-based frameworks.