A Technique for Isolating Lexically-Independent Phonetic Dependencies in Generative CNNs (2506.09218v1)

Published 10 Jun 2025 in cs.CL, cs.SD, and eess.AS

Abstract: The ability of deep neural networks (DNNs) to represent phonotactic generalizations derived from lexical learning remains an open question. This study (1) investigates the lexically-invariant generalization capacity of generative convolutional neural networks (CNNs) trained on raw audio waveforms of lexical items and (2) explores the consequences of shrinking the fully-connected layer (FC) bottleneck from 1024 channels to 8 before training. Ultimately, a novel technique for probing a model's lexically-independent generalizations is proposed that works only under the narrow FC bottleneck: generating audio outputs by bypassing the FC and inputting randomized feature maps into the convolutional block. These outputs are equally biased by a phonotactic restriction in training as are outputs generated with the FC. This result shows that the convolutional layers can dynamically generalize phonetic dependencies beyond lexically-constrained configurations learned by the FC.

Summary

The paper demonstrates that reducing the fully-connected bottleneck to 8 channels preserves phonotactic constraints even when bypassing lexical influences.
The methodology isolates lexically-independent phonetic patterns by leveraging the translation-invariance of convolutional layers with randomized feature maps.
The findings reveal that convolutional layers can encode structured, interpretable outputs, which paves the way for advances in speech synthesis and phonological modeling.

Isolating Lexically-Independent Phonetic Dependencies in Generative CNNs

This paper explores the ability of deep neural networks (DNNs) to form phonotactic generalizations from lexical learning, particularly focusing on generative convolutional neural networks (CNNs) that are trained on raw audio waveforms of lexical items. It investigates whether phonetic dependencies learned by the convolutional layers can be generalized beyond lexically-constrained configurations modeled by fully-connected (FC) layers. The paper proposes a novel experimental technique that isolates these lexically-independent phonetic dependencies by manipulating the FC layer, specifically testing the use of a narrow FC bottleneck configuration of just 8 channels instead of the conventional 1024.

The paper suggests that convolutional layers, due to their translation-invariance, can capture phonetic dependencies beyond lexical structures. By using randomized feature maps as input to bypass the FC layer, the paper explores whether convolutional layers alone can generate outputs that adhere to learned phonotactic restrictions. In essence, this technique serves to distinguish what the convolutional layers encode independently of the FC's influence.

Key Findings

Narrow FC Bottleneck: Training generative CNNs with a reduced FC bottleneck size from 1024 to 8 channels still allows for the generation of lexically-like outputs. This reduced dimensionality is hypothesized to lead to more structured and interpretable outputs when random feature maps are input directly into the convolutional layers.
Lexically-Independent Phonetic Generalization: The paper finds that models with a smaller FC bottleneck can still adhere to phonotactic restrictions present in the training data even when the FC layer is bypassed. Convolutional layers not reliant on lexical constructions of the FC layer can maintain phonetic dependency on local patterns, as seen in voice onset time measurements consistent with training data restrictions against certain sound-vowel sequences.
Qualitative Differences in Outputs: The reduced bottleneck models produce spectrally structured and variable outputs in their Conv-only configurations, demonstrating the convolutional layers' ability to encode interpretable linguistic structures independently.

Implications and Future Directions

Understanding how DNNs can distinguish and generate phonetic patterns independent of lexical constraints enhances interpretability and plausibility of these models in cognitive representation of speech processes. The insight into convolutional layer dynamics offers a methodological advance potentially useful for isolating phonetic features beyond explicit lexical contexts, thus paving the way for cognitive models that prioritize local phonetic dependencies.

Theoretically, the research propels a conversation surrounding the computational division in neural networks between phonetic and lexical processes. Practically, implementing a narrow bottleneck reduces model complexity and allows exhaustive exploration of convolutional layer capacities—offering implications for modeling phonotactic phenomena or augmenting speech synthesis systems.

Future work should focus on:

Extending analyses of latent space disentanglement in models with narrow bottlenecks.
Further exploration of random feature map configurations to delineate the boundaries of convolutional layer generalizations.
Evaluating broader applications and slightly modifying the architecture to explore non-local phonotactic dependencies more efficiently.

This research contributes substantively to the field of phonological modeling by advocating for a structured approach to dissecting generative processes in DNNs where convolutional layers independently contribute to phonotactic learning.