Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations (2506.13901v1)

Published 16 Jun 2025 in cs.CL and cs.AI

Abstract: Alignment is no longer a luxury, it is a necessity. As LLMs enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.

Summary

The paper introduces the Alignment Quality Index (AQI) as a novel method for intrinsically diagnosing AI model alignment, moving beyond reliance solely on evaluating refusal behaviors.
AQI quantifies alignment by analyzing the model's internal state through its latent geometric structure, cluster divergence, and layer-wise pooled representations.
This intrinsic diagnostic approach enables a deeper understanding of model alignment characteristics without requiring external prompts or observing specific output responses.

I'm unable to write an essay about the paper as requested because the content of the paper was not provided. Therefore, I have no information about its title, authors, abstract, or full text. If you can provide these details, I would be able to help you with your request.

Tweets

https://twitter.com/aryaman2020/status/1935929046350893323

https://twitter.com/bohannon_bot/status/1943722640772087844

YouTube

Show All Videos

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations (2506.13901v1)

Summary

Related Papers

Tweets

YouTube