Morphology-Agnostic Latent Intent Spaces
- The paper introduces morphology-agnostic latent intent spaces that separate semantic content from surface-level morphological and syntactic features in neural architectures.
- It details innovative model designs like SEPARATOR that utilize parallel bottleneck structures and techniques such as VQ-VAE and KL regularization for disentangling form and meaning.
- Empirical evaluations demonstrate improved paraphrase quality and controlled form manipulation with faithfulness rates up to 90-95%, highlighting cross-lingual and practical application benefits.
A morphology-agnostic latent intent space is a representation within neural LLMs or encoder-decoder architectures that encodes semantic or pragmatic intent while being systematically disentangled from morphological and syntactic realizations. This concept enables models to generate or interpret utterances across diverse surface forms without entangling core meaning with inflectional, derivational, or syntactic variation. Theoretical and empirical advances in both controlled paraphrase generation and structural probing of large pretrained LLMs have crystallized formal approaches to learning and utilizing such spaces for intent-preserving manipulation, evaluation, and analysis.
1. Disentangling Meaning and Form: Model Architectures
Transformer-based encoder–decoder models can be architected to factor surface-level form (e.g., morphology, syntax, word order) from intent or semantics by explicitly separating network pathways and bottlenecks (Hosking et al., 2021). The SEPARATOR model instantiates this principle:
- Parallel bottleneck structure: The encoder output splits into "meaning" heads (for semantic content) and "form" heads (for morphological/syntactic content).
- Semantic bottleneck: Encoded as a continuous Gaussian variable using pooled representations from meaning heads, regularized by a KL divergence penalty to enforce information minimality and suppress surface-form leakage.
- Form bottleneck: Modeled with a discrete Vector-Quantized VAE (VQ-VAE), using M quantizer heads and codebooks to capture surface templates in a high-capacity, tractable space.
- Decoder: Reconstructs the target utterance autoregressively from concatenated or projected and .
Schematic pathway:
1 2 3 4 |
X ─► Encoder ─► {eₕ,ₜ}
├─► Pool sem ─► q(z_sem|·) ─► z_sem ─┐
└─► Pool syn ─► VQ quantizer ─► z_syn ─┤► Decoder ─► Ŷ
└► cross-entropy with Y |
2. Morphological Subspaces in Pretrained Transformers
Empirical investigations reveal that many morphological transformations enacted by large autoregressive transformers—including pluralization, derivational inflection, and degree change—are encoded by highly linear, low-dimensional operators in hidden state space (Xia et al., 19 Jul 2025).
- Affine Linear Relational Embedding (LRE): Via first-order Taylor approximation, the subject–object mapping is locally captured as
where is the mean Jacobian across in-context examples and an additive bias.
- True Linear LRE: Morphology can often be captured by the multiplicative component alone (), yielding ≈90% faithfulness for inflectional tasks.
- Low-dimensionality: The morphological subspace is the image of , allowing explicit isolation, removal, or inhibition of morphological content from latent representations.
- Orthogonal projection: Representations can be projected orthogonally to this subspace to obtain stem-only, morphology-agnostic encodings, supporting semantic interpretation independent of surface form.
3. Training Objectives and Invariance Guarantees
Enforcing a morphology-agnostic latent intent space necessitates carefully constructed objectives that penalize leakage of surface realization into semantic variables and distribute form-specific information onto discrete latent codes.
The primary losses in SEPARATOR (Hosking et al., 2021) include:
- Reconstruction loss (): Teacher-forced cross-entropy over decoder predictions given intent and form latents.
- KL penalty (): Regularizes toward a standard Gaussian, dissuading encoding of non-semantic features.
- VQ-VAE losses (, , ): Shape form bottleneck quantization and codebook utilization.
- Classifier loss (): At test time, a small classifier predicts new form codes from the paraphrase cluster, enabling controllable paraphrase without external exemplars.
Together, these objectives promote the separation of form and meaning, ensuring that intent representations are strictly morphology-agnostic in practice.
4. Empirical Evaluation and Specialization
Empirical evaluation confirms that explicitly separated morphology-agnostic spaces yield superior intent preservation and surface-form control:
- Paraphrase quality: SEPARATOR achieves higher iBLEU scores (14.8 on Paralex, ≈5.8 on Quora Question Pairs), balancing semantic fidelity with surface novelty (Hosking et al., 2021).
- Faithfulness of linear decoding: For morphological relations in GPT-J, linear LRE achieves ≈90% faithfulness, outperforming semantic/encyclopedic analogies (≈40%) (Xia et al., 19 Jul 2025).
- Cross-lingual generality: Linearization of morphology holds across eight languages, demonstrating that even agglutinative forms exhibit substantial linear faithfulness.
- Head specialization: Learned quantizer heads tend to specialize in distinct syntactic/morphological features, e.g., question words or presence of complex phrases.
A summary of key results is given below.
| Method/Metric | Paralex iBLEU | GPT-J Morph Faithfulness |
|---|---|---|
| SEPARATOR (Hosking et al., 2021) | 14.8 | — |
| Linear LRE (Xia et al., 19 Jul 2025) | — | 90% |
| Affine LRE | — | 95% |
5. Practical Manipulation and Analysis of Latent Spaces
The explicit identification of morphological subspaces enables both circuit editing and controlled intent recovery:
- Subspace projection: Projecting representations orthogonally to the -derived morphology subspace yields pure intent encodings.
- Circuit editing: Zeroing out morphological subspace components in network activations can inhibit undesired inflections without semantic degradation.
- Latent intent recovery: For downstream tasks (e.g., sentiment analysis, role labeling), using only the residual subspace leads to representations robust to inflectional or syntactic artifacts.
- Form manipulation: Discrete latent codes enable plug-and-play surface template selection, supporting diverse and controllable paraphrase generation without reliance on external exemplars.
6. Limitations and Extensions
Current formulations are subject to several limitations:
- Model scope: Results are centered on GPT-J and Llama-7b; applicability to larger or differently trained models may vary (Xia et al., 19 Jul 2025).
- Grammatical coverage: Experiments focus primarily on single-token subject–object pairs and specific morphological relations; compositional and discourse-level phenomena may require higher-rank or nonlinear representations.
- Causality of linear subspaces: While high faithfulness is observed, Jacobian-based identification does not confirm that is causally responsible for morphological transformations in all cases.
- Hierarchy of grammatical subspaces: Potential exists to stack multiple operators corresponding to tense, aspect, degree, etc., yielding coarse-to-fine hierarchical intent spaces.
Potential future directions include:
- Extension to cross-lingual settings and non-question genres.
- Reduced supervision via unsupervised pair mining or back-translation.
- Integration of powerful form-code predictors.
- Probing beyond morphology, including semantic roles and discourse relations, to delineate which features admit linear subspace separation and which require more complex architectures (Hosking et al., 2021, Xia et al., 19 Jul 2025).