- The paper introduces persistent homology to quantify how adversarial influences compress LLM latent spaces, using topological features from Vietoris-Rips filtrations.
- It vectorizes persistence barcodes into statistical summaries to robustly differentiate clean activations from adversarially perturbed ones across multiple large language models.
- Global and local analyses uncover distinct topological signatures, offering new avenues for interpretability and adversarial detection in high-dimensional neural data.
This paper, "Holes in Latent Space: Topological Signatures Under Adversarial Influence" (2505.20435), introduces Persistent Homology (PH), a technique from Topological Data Analysis (TDA), as a principled method to characterize and understand the impact of adversarial conditions on the latent spaces of LLMs. The core problem addressed is the need for interpretability methods that can capture both the global structure and local details of high-dimensional LLM activation spaces under adversarial influence, which existing methods like linear probes or mechanistic interpretability struggle to fully address.
The authors propose using PH to analyze the geometric and topological structure of LLM activations. They view sets of activation vectors as point clouds in a high-dimensional space and apply the Vietoris-Rips filtration process. This involves "thickening" the points by increasing a radius parameter, connecting points as their balls intersect to form increasingly complex simplicial complexes (collections of points, edges, triangles, etc.). PH tracks how topological features, specifically connected components (0-dimensional holes) and loops/cycles (1-dimensional holes), are "born" and "die" throughout this filtration process. The output is a persistence barcode, a visual summary of these features' lifetimes across scales.
To make PH applicable to large-scale LLM data and integrate it with machine learning, the authors vectorize the barcodes into 41-dimensional statistical summaries (e.g., mean/median/std deviation of birth/death times, total persistence, number of features, persistent entropy). They paper these topological signatures under two distinct adversarial attack modes:
- Indirect Prompt Injection (XPIA): Where hidden instructions in data context override user prompts.
- Sandbagging: Where a model is fine-tuned to underperform until a specific "password" prompt elicits its full capability.
They analyze activation data from the last token across six state-of-the-art LLMs (Phi3-mini-4k, Phi3-medium-128k, Mistral 7B, LLaMA3 8B, LLaMA3 70B, Mixtral-8x7B). Due to the computational intensity of PH, they employ random subsampling techniques, drawing numerous smaller point clouds (e.g., 4096 activations per subsample) from the full dataset per layer and computing barcodes for these subsamples.
The research presents two main analyses:
- Global Layer-Wise Analysis: They compute barcode summaries for subsamples of clean and adversarial activations across different layers of the models. Using techniques like PCA, CCA, logistic regression, and SHAP values, they demonstrate that these topological features alone can achieve near-perfect separation between normal and adversarial states. The key finding is that adversarial conditions consistently "compress" the latent space topology. This compression manifests as:
- Fewer topological features (fewer 1-bars or loops).
- Features emerging later in the filtration process (larger birth times).
- Dominant features persisting longer (larger death times and persistence).
- Lower persistent entropy, indicating that total persistence is concentrated in a few large-scale features rather than distributed across many small ones.
- Conversely, clean activations exhibit higher topological diversity at smaller scales, with more numerous, shorter-lived features and higher entropy. This distinction is statistically robust across different models, sizes, and layers, though its prominence might shift to deeper layers in larger models.
- Local Information Flow Analysis: To understand finer-grained mechanisms, they analyze the relationships between neuron activations across pairs of layers. They project the D-dimensional activation vectors from two layers into a 2D point cloud where each point represents a neuron and its coordinates are its activation values in the two layers. The topology of this 2D point cloud captures how activations co-vary or transform between layers. They compare the topological features (specifically total persistence of 0- and 1-bars, mean birth/death times) for these 2D point clouds derived from clean and adversarial inputs.
- They observe clear differences in topological features between clean and poisoned activations, particularly in early-to-mid layers, which normalize out when scaling activations but persist meaningfully compared to a randomized control (where neuron indices are permuted, destroying meaningful across-layer correspondence).
- For clean data, topological complexity decreases in deeper layers, suggesting stabilization. For poisoned data, complexity may initially decrease but then increase in deeper layers, indicating divergence from clean processing.
- Analyzing poisoned states based on model response (executed, refused, ignored) reveals distinct topological patterns. Executed/ignored prompts often lead to higher dispersion in mid-layers (more representational capacity allocated), while refused prompts show reduced dispersion (compressed representation).
- Analyzing pairs of non-consecutive layers shows that the topological differences related to neuron interactions diminish as the layer interval increases.
- Crucially, they show that layers exhibiting high overall variance in their topological features (even without class labels) are strongly associated with layers where the topological signatures of clean vs. poisoned states differ most significantly. This suggests high-variance layers could be key monitoring points in real-world applications.
Practical Implications:
- Interpretability: PH provides a novel, multiscale "topological lens" to understand the geometric shape of LLM latent spaces and how it changes under different inputs or conditions. It offers insights complementary to linear methods.
- Adversarial Detection/Monitoring: The distinct topological signatures identified could potentially be used to detect or flag adversarial inputs or compromised model states based on deviations from the expected "clean" topology. Monitoring layers with high topological variance might be particularly effective.
- Understanding Attack Mechanisms: By analyzing topology under different attack types (like executed vs. refused injections), researchers can gain a deeper understanding of how these attacks perturb the model's internal processing.
Implementation Considerations:
- The primary practical challenge is the computational cost of PH, which scales poorly with the number of data points. The authors mitigate this using subsampling, which is a standard practice in TDA but introduces sampling error (though the authors argue this is bounded and doesn't affect their main conclusions given their chosen subsample size).
- Implementing PH requires specialized libraries like Ripser++ and expertise in TDA concepts.
- Applying this in real-time inference would likely require significant optimization or focusing on summary statistics computed efficiently on smaller batches or subsamples.
Limitations:
- Computational cost limits analysis to subsamples, not the entire latent space of all activations.
- The paper focused on two specific adversarial attack types; generalization to other forms of misalignment or adversarial behavior needs further investigation.
Future Work:
- Investigate if topological compression is a universal property of misalignment and its relation to generalization/memorization.
- Develop topology-aware robustness mechanisms or defenses.
- Apply more advanced TDA techniques like persistent Morse theory or cycle matching for richer characterization.
- Test the topological approach on a broader range of adversarial scenarios.