Polish phonology and morphology through the lens of distributional semantics

Published 31 Mar 2026 in cs.CL | (2604.00174v1)

Abstract: This study investigates the relationship between the phonological and morphological structure of Polish words and their meanings using Distributional Semantics. In the present analysis, we ask whether there is a relationship between the form properties of words containing consonant clusters and their meanings. Is the phonological and morphonological structure of complex words mirrored in semantic space? We address these questions for Polish, a language characterized by non-trivial morphology and an impressive inventory of morphologically-motivated consonant clusters. We use statistical and computational techniques, such as t-SNE, Linear Discriminant Analysis and Linear Discriminative Learning, and demonstrate that -- apart from encoding rich morphosyntactic information (e.g. tense, number, case) -- semantic vectors capture information on sub-lexical linguistic units such as phoneme strings. First, phonotactic complexity, morphotactic transparency, and a wide range of morphosyntactic categories available in Polish (case, gender, aspect, tense, number) can be predicted from embeddings without requiring any information about the forms of words. Second, we argue that computational modelling with the discriminative lexicon model using embeddings can provide highly accurate predictions for comprehension and production, exactly because of the existence of extensive information in semantic space that is to a considerable extent isomorphic with structure in the form space.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a robust computational framework using fastText, Word2vec, and the Discriminative Lexicon Model to achieve over 97% comprehension accuracy for mapping Polish forms to meanings.
It employs techniques such as LDA and t-SNE to visualize and quantify the clear clustering of phonotactic and morphosyntactic features within semantic spaces.
The study challenges traditional views by demonstrating that detailed phonological and morphological cues systematically influence semantic representations in Polish.

Semantic Isomorphy of Form and Meaning in Polish: A Distributional Perspective

Introduction

This study presents a rigorous computational investigation into the interaction between phonological, morphonological, and morphosyntactic structure and semantics in Polish, a language characterized by highly complex consonant clusters, rich inflectional morphology, and nontrivial morphonotactic processes. The work leverages distributional semantics—specifically, fastText and Word2vec embeddings—and the Discriminative Lexicon Model (DLM) to systematically explore the extent to which semantic space encodes or mirrors sub-lexical, morphological, and morphotactic regularities. Linear Discriminant Analysis (LDA) and t-SNE projections are employed to quantify and visualize the semantic separability of features at different linguistic strata. The study postulates and empirically supports a strong isomorphy between form space and semantic space in Polish, challenging strict views on the arbitrariness of the sign.

Theoretical and Linguistic Background

Contrary to the traditional stance of dual articulation and arbitrariness (de Saussure, Martinet), a growing empirical literature contends that phonetic, phonological, and morphological structure is systematically reflected in word meanings and vice versa. The evidence spans iconicity, duration differences among English homophones, and morphologically-driven phonetic effects in diverse typologies. Polish provides an ideal testbed due to its extreme phonotactic permissiveness, exemplified by morphonotactically intricate initial clusters arising from productive and unproductive prefixation, vowel-zero alternations, and rich case, gender, and number distinctions. Of particular interest are productive aspectual prefixes (e.g., s-, z-, w-, wz-), the gradient parsability of morpheme boundaries, and alignment between morphological decomposition and perception/production mechanisms.

Dataset Construction and Representation

The study compiles a corpus-derived dataset of 8,015 unique words (strings) from the 8,000-lemma "Słownik Podstawowy Języka Polskiego dla Cudzoziemców" and extracts all forms containing initial morphonotactic clusters, subsequently deduplicated for syncretic forms. Each form is annotated for part of speech, detailed morphosyntactic categories (case, number, gender, person, tense, aspect), and cluster-level phonological/morphonological attributes (size, markedness, prefix identity, morpheme parsability). Embeddings are provided both by fastText (subword-sensitive) and standard Word2vec (whole-word-based), allowing assessment of the impact of sublexical distributional information.

Discriminative Lexicon Model (DLM) and Performance

The DLM is instantiated with form representations as n-gram presence vectors and meanings as embeddings. As a strictly compositional, analogy-based approach, the DLM eschews explicit morpheme segmentation, learning only linear mappings between form space and embedding space via analytic solutions (end-state learning). Comprehension (form-to-meaning) accuracy reaches >97% across train, test, and held-out sets; production accuracy is comparable for seen items and 77.8% for held-out forms, with most errors involving grammatical feature substitutions rather than stem misprediction. Crucially, out-of-vocabulary productions are overwhelmingly well-formed and legal, further confirming the generalizability of the learned mappings.

Isomorphy between Form, Morphology, and Semantic Space

Visualization of High-Dimensional Structure

Scatterplots of 2D t-SNE projections (e.g., for parts of speech, case, number, person, gender, cluster size/markedness) show nontrivial but robust clusterings for morphosyntactic categories (Figure 1, Figure 2). Most strikingly, even purely phonotactic properties (e.g., initial cluster markedness) induce weak—yet measurable—clustering in semantically induced structures, as evidenced in Figure 3.

Figure 1: Scatterplots in t-SNE space color-coded for POS and detailed morphosyntactic classes, demonstrating strong separation by major morphosyntactic categories.

Figure 2: Gender-based t-SNE clusterings, showing non-random distribution and partial overlaps directly corresponding to morphological syncretism.

Discriminability Analysis (LDA)

Supervised LDA with cross-validation quantifies the degree of semantic isomorphy for a wide spectrum of linguistic features:

Morphosyntactic categories (e.g., person, tense, aspect, case, number) are predicted from embeddings at accuracies between 89–99%, greatly exceeding majority baselines.
Fine-grained morphonotactic and phonotactic properties (prefix identity, cluster markedness, cluster length, morpheme boundary parsability) are also recoverable with 81–97% accuracy.

This holds for both fastText and Word2vec representations, though fastText consistently reports higher discriminability, particularly as cluster complexity and sublexical irregularity increase.

Principal Components Analysis

Variance partitioning (Figure 4, Figure 5) reveals that morphosyntactic features account for significant portions of the first 50–100 principal components, signifying direct entanglement with macro-axes of semantic variation. In contrast, phonotactic and morphonotactic variables are encoded diffusely, requiring 200+ PCs for full classification accuracy, indicating weaker, distributed representation.

Figure 4: Cumulative explained variance by principal components in semantic space.

Figure 5: Classification accuracy of morphosyntactic and phonotactic categories as a function of number of principal components used in LDA.

Implications and Contributions

This work decisively demonstrates that in Polish, morphological (inflectional, derivational), morphonological, and a broad spectrum of phonotactic properties are systematically encoded in semantic embeddings derived purely from distributional statistics. The correspondence transcends morphological boundaries and is recoverable even in embeddings not privy to subword information, highlighting the prevalence of non-arbitrary form-to-meaning mappings. Morphosyntactic variables—person, case, tense, gender—not only structure semantic space but also dominate principal axes of variation, underscoring their centrality in meaning organization. The recoverability of phonotactic features through embeddings provides compelling evidence for a gradient—and partially iconic—mapping between linguistic form and usage-induced semantics.

These findings challenge strict decompositional and symbolic models of morphology by empirically validating the DLM’s distributional, non-segmental mapping. The demonstration that production and comprehension can be highly accurate in a system with opaque morphology, heavy syncretism, and non-concatenative morphonotactics indicates that Polish learners may efficiently exploit analogical and distributional cues without morpheme segmentation.

For computational linguistics, the direct association between phonotactics/morphonotactics and semantic structure foreshadows efficient, language-generic approaches to processing languages with complex morphophonology. The results inform architecture choices for unsupervised LLMs and ground claims about the emergent nature of meaning in natural language.

Conclusion

This comprehensive study offers robust empirical support for the theoretical position that semantic space in natural language—even in a typologically complex system such as Polish—significantly reflects and encodes detailed properties of word form, morphological structure, and phonotactic legality. The isomorphism between meaning and form, quantified here, challenges the tenet of radical arbitrariness and sets the stage for more integrated, analogical models of the mental lexicon. Future research can generalize these findings to other morphologically rich, phonotactically intricate lexicons, examining typological parameters that mediate the strength of semantic-form alignment and further refining the representation of morphonotactic cues in neural and probabilistic models of language.

Markdown Report Issue