Create a Video View Paper

The Information Geometry of Softmax: Probing and Steering

This presentation explores how traditional Euclidean approaches to steering AI model behavior are fundamentally limited by ignoring the true information-theoretic geometry of softmax-based models. The authors introduce dual steering, a principled method grounded in Bregman geometry that enables precise control of target concepts while provably preserving unrelated behaviors—addressing the brittleness and probability leakage that plague conventional steering techniques.

Script

What if the way we've been steering AI models has been geometrically wrong from the start? When researchers try to edit a language model to prefer cats over dogs, or shift any concept really, the standard approach often leaks probability to completely unrelated outputs—a brittleness that hints at a deeper mismatch between method and mathematics.

Building on that tension, the researchers explain how current methods assume a flat Euclidean space when manipulating model representations. But the softmax operation that converts these representations into probabilities actually induces a curved, information-theoretic geometry—and ignoring that structure causes the steering to go awry.

So what is the right geometry, and how does recognizing it change everything?

The key insight is that softmax induces a dual geometry with two coordinate systems. The primal space holds the raw representation vectors, while the dual space—linked by the softmax transformation—is where probability distributions and concepts naturally live, connected by Kullback-Leibler divergence as the fundamental distance measure.

To see why this duality matters, consider interpolating between two distributions. Moving linearly in primal space creates an intersection of modes—you get only what both endpoints share. But interpolating in dual space produces a mixture that unions the modes, keeping the full semantic range of both, which is exactly what you want when steering concepts.

This diagram illustrates the core advantage visually. When you steer in the dual space, probability mass moves cleanly from your base concept to your target concept—cat to dog, for instance—without bleeding into unrelated categories. That targeted transfer is precisely what Euclidean methods fail to achieve, and it comes directly from respecting the information geometry.

Now let's see how the authors turn this geometric principle into a practical algorithm.

Implementing dual steering requires solving a constrained optimization: you move linearly in dual coordinates, then use a regularized Newton method to find the corresponding primal representation. The regularization handles practical issues like low-entropy distributions and ensures the path stays geometrically feasible throughout the steering process.

The empirical results are striking. Across language and vision-language models—Gemma and MetaCLIP—and across diverse concept types like verb tenses and object attributes, dual steering consistently preserves off-target distributions better than Euclidean methods. The bars show dual steering maintains lower KL divergence and higher probability mass on valid counterfactual pairs, exactly as the theory predicts.

These results have broad implications for AI safety and control. Dual steering provides the first geometrically principled method for targeted concept editing with formal guarantees about what stays unchanged. That level of precision is essential for deploying language models in high-stakes settings where unintended behavioral shifts could be catastrophic, and it fundamentally reframes how we should think about probing and steering neural representations.

The geometry we choose determines the control we achieve—and in softmax models, that geometry is decidedly non-Euclidean. Visit EmergentMind.com to explore how information geometry is reshaping our ability to interpret and steer AI systems.