Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 92 tok/s Pro
GPT OSS 120B 425 tok/s Pro
Kimi K2 157 tok/s Pro
2000 character limit reached

Token embeddings violate the manifold hypothesis (2504.01002v2)

Published 1 Apr 2025 in cs.CL and cs.AI

Abstract: A full understanding of the behavior of a LLM requires our understanding of its input token space. If this space differs from our assumptions, our understanding of and conclusions about the LLM will likely be flawed. We elucidate the structure of the token embeddings both empirically and theoretically. We present a novel statistical test assuming that the neighborhood around each token has a relatively flat and smooth structure as the null hypothesis. Failing to reject the null is uninformative, but rejecting it at a specific token $\psi$ implies an irregularity in the token subspace in a $\psi$-neighborhood, $B(\psi)$. The structure assumed in the null is a generalization of a manifold with boundary called a \emph{smooth fiber bundle} (which can be split into two spatial regimes -- small and large radius), so we denote our new hypothesis test as the ``fiber bundle hypothesis.'' Failure to reject the null hypothesis is uninformative, but rejecting it at $\psi$ indicates a statistically significant irregularity at $B(\psi)$. By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the evidence suggests that the token subspace is not a fiber bundle and hence also not a manifold. As a consequence of our findings, when an LLM is presented with two semantically equivalent prompts, if one prompt contains a token implicated by our test, the response to that prompt will likely exhibit less stability than the other.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper challenges the manifold hypothesis by demonstrating that token embeddings often exhibit complex local geometries inconsistent with low-dimensional manifolds.
  • It applies a fiber bundle null test using local PCA and eigenvalue analysis to distinguish between signal and noise dimensions in token neighborhoods.
  • Findings indicate that tokens with intricate local structures drive unpredictable LLM output variability, influencing prompt engineering and model interpretation.

The paper "Token embeddings violate the manifold hypothesis" (2504.01002) investigates the geometric structure of the input space for LLMs—specifically, the space populated by token embeddings. It challenges the common assumption, often implicit, that these embeddings lie on or near a low-dimensional manifold. The authors propose and test an alternative model based on fiber bundles, ultimately finding that even this more generalized structure is often insufficient to describe the local geometry around many tokens. This has significant implications for understanding LLM behavior, robustness, and the interpretation of model internals.

Theoretical Framework: Beyond the Manifold Hypothesis

The manifold hypothesis posits that high-dimensional data, such as word embeddings, often concentrates near a lower-dimensional, smooth manifold embedded within the ambient space. If this holds, the intrinsic dimensionality of the data is much smaller than the embedding dimension, and local neighborhoods should resemble Euclidean space Rd\mathbb{R}^d for some small dd.

This paper argues that the manifold hypothesis may not adequately capture the structure of token embedding spaces. Instead, they propose a local model based on fiber bundles, a generalization of manifolds. In this model, the neighborhood N(ei)N(e_i) of a token embedding eie_i is hypothesized to decompose into two distinct components: a "signal" subspace and a "noise" subspace. More formally, they model the local structure as being homeomorphic to a product space B×FB \times F, where BB represents the base space (capturing meaningful local variations, or "signal") and FF represents the fiber (capturing stochastic or less meaningful variations, or "noise"). This structure is characteristic of a trivial fiber bundle locally. The dimensions associated with BB constitute the local signal dimension, and those associated with FF constitute the local noise dimension.

Methodology: The Fiber Bundle Null Hypothesis Test

To empirically assess the validity of the local fiber bundle structure, the authors introduce a statistical hypothesis test termed the "fiber bundle null."

  • Null Hypothesis (H0H_0): The local neighborhood of a given token embedding eie_i is consistent with a fiber bundle structure. Specifically, the local geometry can be adequately modeled as a product B×FB \times F, implying a clear separation between signal and noise dimensions locally.
  • Alternative Hypothesis (H1H_1): The local neighborhood structure significantly deviates from a fiber bundle structure.

The implementation of this test involves analyzing the local geometry around each token embedding:

  1. Neighborhood Identification: For a target token embedding eie_i, identify its kk nearest neighbors {ej}\{e_j\} in the embedding space using a standard distance metric (e.g., Euclidean distance).
  2. Local PCA: Consider the set of difference vectors {vj=ejei}\{v_j = e_j - e_i\}. Perform Principal Component Analysis (PCA) on this set of vectors {vj}\{v_j\}. PCA identifies the principal directions of variation within the local neighborhood.
  3. Eigenvalue Analysis: Examine the eigenvalues {λk}\{\lambda_k\} obtained from the PCA. Under the fiber bundle null hypothesis (H0H_0), the eigenvalues are expected to exhibit a clear split: a set of larger eigenvalues corresponding to the signal dimensions (base space BB) followed by a set of smaller, relatively uniform eigenvalues corresponding to the noise dimensions (fiber FF).
  4. Statistical Testing: A statistical test is formulated to determine if the observed eigenvalue spectrum is consistent with this hypothesized split. While the paper details the specifics, this could involve, for example, testing for a significant gap in the sorted eigenvalues or comparing the decay profile to theoretical distributions expected under H0H_0. Rejection occurs if the observed pattern significantly deviates from the expected structure.

Essentially, the test probes whether the local variance structure around a token can be cleanly partitioned into distinct signal and noise components. Rejecting the null suggests a more intricate local geometry that cannot be simply described as a manifold or even a trivial fiber bundle.

Empirical Findings Across LLMs

The authors applied the fiber bundle null test to the token embedding spaces of several publicly available LLMs. The key finding is that the null hypothesis is frequently rejected for a non-negligible fraction of tokens across all tested models.

This implies that for many tokens, the local geometric structure of their embedding neighborhoods is statistically inconsistent with a simple manifold or even the proposed fiber bundle model. The local geometry is more complex than previously assumed. The specific tokens that lead to rejection often vary between models, reflecting differences in their learned embedding spaces. The paper suggests that the frequency of rejection and the identified "local signal dimension" (derived from the number of significant principal components before the purported noise floor) are important characteristics of the embedding space.

Practical Implications and Consequences

The rejection of the manifold and fiber bundle hypotheses for token embeddings has several practical consequences:

  1. Limitations of Global Analysis: Techniques that assume a global low-dimensional manifold structure (e.g., standard manifold learning algorithms like t-SNE or UMAP applied globally, or linear probes assuming simple geometric relationships) might provide an incomplete or potentially misleading picture of the embedding space. The local structure is heterogeneous and often complex.
  2. Understanding Output Variability: The paper posits a direct link between the local geometric structure of input tokens and the variability of the LLM's output. Specifically, if a prompt contains tokens for which the fiber bundle null is rejected (indicating significant, complex local structure or a high local signal dimension), the LLM is predicted to exhibit higher output variability or sensitivity for that prompt, even compared to semantically similar prompts using different tokens. This local structure can interact with the model's subsequent layers (e.g., attention mechanisms) in non-trivial ways.
  3. Prompt Engineering and Robustness: Tokens identified by the test as having complex local structure might be points of instability or sensitivity for the LLM. Understanding which tokens exhibit this behavior could inform robust prompt design strategies, aiming to either avoid such tokens or to specifically probe model behavior around these points of complexity. Adversarial attacks might potentially exploit tokens with such intricate local geometries.
  4. Model Interpretation: Analyzing the internal activations and attention patterns of LLMs might benefit from considering the local geometric properties of the input tokens. Attention scores or activation strengths might behave differently depending on whether they originate from tokens with simple versus complex local neighborhoods.

Implementation Considerations

Implementing the fiber bundle null test requires careful consideration:

  • Computational Cost: The primary cost lies in the nearest neighbor search and repeated local PCA calculations for potentially every token in the vocabulary. For large vocabularies (tens of thousands of tokens) and high embedding dimensions, this can be computationally demanding. Approximate nearest neighbor algorithms might be necessary for scalability.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    
    import numpy as np
    from sklearn.neighbors import NearestNeighbors
    from sklearn.decomposition import PCA
    
    def fiber_bundle_test_for_token(token_embedding, all_embeddings, token_index, k_neighbors, significance_level):
        """
        Pseudocode sketch for the fiber bundle null test for a single token.
        Note: The statistical test itself needs detailed implementation based on the paper.
        """
        num_tokens, embedding_dim = all_embeddings.shape
    
        # 1. Find k nearest neighbors (excluding the token itself)
        nn_finder = NearestNeighbors(n_neighbors=k_neighbors + 1, algorithm='auto', metric='euclidean')
        nn_finder.fit(all_embeddings)
        distances, indices = nn_finder.kneighbors(token_embedding.reshape(1, -1))
    
        # Get neighbor embeddings and form difference vectors
        neighbor_indices = [idx for idx in indices[0] if idx != token_index][:k_neighbors]
        neighbors = all_embeddings[neighbor_indices]
        diff_vectors = neighbors - token_embedding
    
        if diff_vectors.shape[0] < embedding_dim: # Need enough neighbors for PCA
             return False, None # Cannot reliably perform PCA
    
        # 2. Perform Local PCA
        pca = PCA()
        pca.fit(diff_vectors)
        eigenvalues = pca.explained_variance_
    
        # 3. Analyze Eigenvalue Spectrum (Statistical Test - Placeholder)
        #    This is the core part requiring implementation based on the paper's specific test.
        #    Example: Check for a significant gap or fit to a noise model.
        reject_null = performs_statistical_test(eigenvalues, significance_level) # Needs definition
    
        # 4. Determine Local Signal Dimension (if needed)
        local_signal_dim = None
        if reject_null:
            # Estimate dimension based on where eigenvalues plateau or based on test specifics
            local_signal_dim = estimate_signal_dimension(eigenvalues) # Needs definition
    
        return reject_null, local_signal_dim
    
    # --- Placeholder functions ---
    def performs_statistical_test(eigenvalues, significance_level):
        # Implement the specific statistical test from the paper [2504.01002]
        # This might involve checking eigenvalue decay rate, gaps, etc.
        print(f"Debug: Eigenvalues for test: {eigenvalues[:10]}...") # Example debug output
        # Dummy implementation
        return np.random.rand() < 0.1 # Replace with actual test logic
    
    def estimate_signal_dimension(eigenvalues):
        # Implement logic to estimate signal dimension based on eigenvalue profile
        # e.g., find the 'elbow' or threshold based on variance explained
        # Dummy implementation
        return int(np.sum(eigenvalues > 0.1)) # Replace with actual estimation logic
    
    # Example Usage (Conceptual)
    # Assume 'embeddings' is a NumPy array of all token embeddings
    # vocab_size, dim = embeddings.shape
    # results = {}
    # for i in range(vocab_size):
    #     token_emb = embeddings[i]
    #     reject, signal_dim = fiber_bundle_test_for_token(token_emb, embeddings, i, k_neighbors=100, significance_level=0.05)
    #     results[i] = {'reject_null': reject, 'signal_dim': signal_dim}
  • Parameter Choices: The number of neighbors (kk) and the significance level (α\alpha) for the statistical test are important hyperparameters that can influence the results. Sensitivity analysis regarding these parameters might be necessary.
  • Statistical Test Details: The precise formulation of the statistical test used to evaluate the eigenvalue spectrum is crucial and needs to be carefully implemented according to the paper's specifications.

Conclusion

The research presented in "Token embeddings violate the manifold hypothesis" (2504.01002) provides compelling empirical evidence that the input spaces of LLMs possess complex local geometric structures. The frequent rejection of both the manifold hypothesis and the more general fiber bundle null hypothesis indicates that many token neighborhoods are not well-approximated by simple low-dimensional Euclidean spaces or product structures. This finding encourages a shift in perspective, suggesting that understanding LLM behavior requires acknowledging and potentially characterizing these intricate local geometries, which appear to influence model sensitivity and output variability. Identifying tokens with such complex local structures may be valuable for robust prompt engineering, model interpretation, and analyzing failure modes.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com