FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

Published 20 Apr 2026 in cs.CL and cs.SD | (2604.18109v1)

Abstract: This paper presents factorized linear projection (FLiP) models for understanding pretrained sentence embedding spaces. We train FLiP models to recover the lexical content from multilingual (LaBSE), multimodal (SONAR) and API-based (Gemini) sentence embedding spaces in several high- and mid-resource languages. We show that FLiP can recall more than 75% of lexical content from the embeddings, significantly outperforming existing non-factorized baselines. Using this as a diagnostic tool, we uncover the modality and language biases across the selected sentence encoders and provide practitioners with intrinsic insights about the encoders without relying on conventional downstream evaluation tasks. Our implementation is public https://github.com/BUTSpeechFIT/FLiP.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces FLiP as a factorized linear projection model to recover lexical information from multilingual and multimodal embeddings.
It demonstrates strong parameter efficiency and improved keyword extraction accuracy compared to baseline methods like SpLiCE.
FLiP reveals cross-modal and cross-lingual semantic linearity, while highlighting an English bias in embedding spaces.

FLiP: Factorized Linear Projection for Multimodal Multilingual Sentence Embedding Interpretability

Introduction and Motivation

The paper "FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings" (2604.18109) introduces the Factorized Linear Projection (FLiP) model as a systematic approach to diagnosing and interpreting sentence embedding spaces. FLiP serves to recover lexical content from multilingual (LaBSE), multimodal (SONAR), and API-based (Gemini) embedding spaces, facilitating insight into what semantic information is encoded and how it is distributed across modalities and languages. Unlike prior interpretability techniques focused primarily on English or on single modality embeddings, FLiP generalizes across modalities (text, speech) and languages (high/mid-resource), providing a unified diagnostic tool that does not rely on downstream task-centric evaluation.

Methodological Framework

FLiP is situated within the linear representation hypothesis paradigm, advocating for interpretable linear projections to probe semantic structure in embedding spaces. It reframes interpretability as a linear keyword extraction task, employing:

Log-linear Model: Mapping sentence embeddings to vocabulary space via a single linear mapping (projection matrix $W$ and bias $b$ ).
Factorization: Decomposition $\mathbf{W} = \mathbf{AB}$ , where $A$ is the word embedding matrix and $B$ projects modality-specific embeddings to a latent space. Factorization (with explicit low-rank and $L_1$ regularization) enhances parameter efficiency and imposes strong implicit regularization.
Cross-modal and Cross-lingual Training: By integrating speech and text embedding pairs in the training objective, FLiP probes whether semantic concepts align linearly across modalities and languages, compensating for variability and measuring semantic transfer.

Keyword extraction for unseen embeddings is performed by computing logits via the factorized linear projection and selecting top- $k$ indices associated with vocabulary words. The bias term $b$ learns a log-unigram prior, modulating retrieval of frequent words and affecting named entity recall.

Experimental Setup

FLiP and baselines (non-factorized linear projection; SpLiCE) are evaluated using:

Common Voice (MCV) corpus for speech-text pairs (EN, DE, FR)
Europarl and Samanantar for parallel bitexts (EN- $\{$ DE, FR, BN, HI, TA, TE $\}$ )
Standard, span-aware, and Jaccard metrics for keyword extraction performance
Comparison with SpLiCE using matched concept vocabulary and span-aware accuracy

Embeddings are extracted from SONAR, LaBSE, and Gemini models, each representing distinct training paradigms and modalities. Vocabulary is standardized at $b$ 0k unigrams; speech and text embeddings are matched via cross-modal training.

Results: Semantic Linearity and Diagnostic Insights

Factorization and Rank

FLiP demonstrates that factorizing the projection matrix yields substantial regularization benefits, significantly outperforming non-factorized models at all ranks. At rank $b$ 1, FLiP recalls $b$ 2 of text lexical items and $b$ 3 for speech embeddings, closely approaching full-rank performance with fewer parameters.

Modality Alignment

Cross-modal probing reveals robust semantic linearity within languages: keyword extraction accuracy is consistent between text-only and speech-only FLiP models, particularly in SONAR's unified space. Training on speech embeddings slightly enhances transferability to text, evidence that SONAR's multimodal alignment is effective.

Cross-Lingual Semantic Linearity

FLiP identifies a pronounced English bias in embedding spaces: training on EN embeddings allows transfer to related languages (DE, FR) but results in significant accuracy drops for less-related languages (e.g., EN-TA, EN-TE). Conversely, non-English embeddings do not yield comparable transfer, confirming a degradation in semantic linearity as linguistic distance increases.

Vocabulary Effects

Using non-target vocabularies exacerbates performance degradation, especially for linguistically distant languages and bilingual training regimes. For EN-TA, accuracy drops from $b$ 4 to $b$ 5 when switching from EN to TA vocabulary, illustrating the dominance of English as the conceptual anchor in SONAR and similar models.

Model Comparison

SONAR consistently outperforms LaBSE and Gemini across all vocabulary/embedding pairs, but all show the same trend: highest keyword recall with English vocabulary and embeddings, declining for non-English settings.

Named Entity Recall Dynamics

FLiP can recover named entities from embeddings with high fidelity, but retrieval is sensitive to the bias term (log-unigram prior). Removing the bias enhances information-centric retrieval by down-weighting frequent stop words, improving recall for entities as $b$ 6 increases.

Figure 1: Named-entity recall as function of top- $b$ 7 extracted keywords and the effect of bias vector on the logit scores.

Comparison with SpLiCE

FLiP achieves span-aware accuracy of $b$ 8 (text) and $b$ 9 (speech), nearly double that of SpLiCE (text $\mathbf{W} = \mathbf{AB}$ 0, speech $\mathbf{W} = \mathbf{AB}$ 1) using identical concept vocabularies. Thus, FLiP provides a superior, task-agnostic interpretability mechanism without relying on vocabulary heuristics.

Implications and Future Directions

FLiP establishes that semantic information in sentence embeddings is largely linearly recoverable, with well-trained multimodal encoders (notably SONAR) yielding $\mathbf{W} = \mathbf{AB}$ 2 recall of lexical content. This reinforces the linear representation hypothesis at scale for multilingual, multimodal scenarios. The English bias observed underscores critical challenges for genuinely language-agnostic embeddings, raising the need for more equitable multilingual training and diagnostic tools. Practically, FLiP enables rapid intrinsic model diagnosis and interpretability, supporting model debugging, bias analysis, and semantic probing without downstream task reliance.

In future developments, the FLiP paradigm could be extended to compositional concept probing, adaptive cross-lingual factorization, and integration with continuous semantic spaces beyond unigrams (multi-word expressions, structured knowledge). Further, FLiP-style diagnostics may become indispensable for benchmarking new generative and retrieval-centric models, driving representation learning toward greater transparency and fairness.

Conclusion

FLiP is a robust, efficient method for interpreting multimodal and multilingual sentence embedding spaces, revealing strong semantic linearity and substantial English bias in current encoder architectures. Through explicit factorization and cross-modal/cross-lingual probing, FLiP achieves parameter-efficient, high-fidelity lexical recovery from embeddings and offers superior interpretability compared to existing baselines. Its adoption as a diagnostic tool can drive progress in multilingual representation learning and fairer cross-modal modeling, with direct implications for both research and practical deployment in multilingual AI systems.

Markdown Report Issue